Will P2 be able to do professional Voice Identification?

william chan · 2018-07-25 08:46

So now P2 will have ADC pins so it will be super easy to connect Microphones to any pin.
My question, Is the P2 powerful enough to identify a person based on his/her voice?

Peter Jakacki · 2018-07-25 08:53

Impossible!

However you don't need an ADC to interface digital MEMS microphones, just clock and data. I would have thought an ARM chip more suitable for this type of thing though since it may need quite some memory.

evanh · 2018-07-25 10:19

I would think yes. The Prop2 has the processing power. And enough space for handling live voice processing.

Not that I have any knowledge on best methods but I don't see why a recognition signature would need to be bigger than 1 KB. It only needs a true/false decision made.

william chan · 2018-07-25 15:38

However you don't need an ADC to interface digital MEMS microphones, just clock and data.

Would it be better to store the voice sample in PDM or PCM for fast identification processing later?

Rayman · 2018-07-25 16:18

Voice ID would be great.
I think our Alexa uploads sounds to the cloud, right?
Maybe that's a better solution needing less MPU power?

Still, Phil Pilgrim seems to have made one that seemed to work OK even on P1.
So, maybe easy to improve with P2.

lardom · 2018-07-25 16:41

Rayman wrote: »

Still, Phil Pilgrim seems to have made one that seemed to work OK even on P1.
So, maybe easy to improve with P2.

Phil Pilgrim wrote a P1 object based on Goertzel's algorithm that could identify four words. It made my jaw drop. After I saw that I posted a demonstration video on YouTube and then I tried to learn about fourier transforms.
I think someone here will find a way to do it.

David Betz · 2018-07-25 16:49

lardom wrote: »

I think someone here will find a way to do it.

Are you sure this isn't the opposite of the "impossible" incantation? Might this jinx any attempt to do this?

cgracey · 2018-07-25 17:40

You could do that.

It seems, though, that the recent growth of such capabilities is enabled by neural networks, which are "trained" to perform decision-making tasks, instead of "programmed" to. They don't work in an exact manner, like how a program executes. At the bottom-most level, each neural node makes sense, but nobody can make any high-level determination from that level, because the purposeful behavior emerges from too many such sub-parts to get one's head around.

Neural nets have made lots of neat things possible, but they substitute exactness (which is hard to program, maybe too discriminatory) for probability (a child could train it, but it will be as errant as a child).

If you could figure out what it is about people's voices that differentiate them from each other, you could write an algorithm on the P2 to perform the discrimination, and then weigh out your determinations to arrive at a result.

Neural nets are being applied to a lot of things these days, like spell checking and predicting the end of what someone is typing into a text window. I see these things making obvious mistakes, like capitalizing letters on words because they've seen those words appear in their training data sets as parts of proper nouns. My point is, they know nothing about English, exactly, just statistical likelihoods of one thing following another.

cgracey · 2018-07-25 18:12

I think a person's vocal formants, which are a function of their upper-body shape and typical muscle posturing, audibly reveal that person's uniqueness. They don't need to talk, either - just coughing or breathing signals the formants. The formants are resonances and anti-resonances in their vocal tract.

potatohead · 2018-07-25 21:09

Yes, for mere id, the formants would work well enough. It can be spoofed by a voice actor, who has range and shape close to the target, and or a recording though.

Still, good hobby level fun, I would think.

I just wanted to add something about the neural nets. Big, complicated ones and their training are resource intensive, but once they are created, they can be pruned down some. And sometimes pruned down much more than one would think.

A while back, I was reading a paper on fidelity needed to perform. A couple cases were shown where a robust net, say 8 bits per node, was scaled down to as little as one bit while retaining useful functionality!

This is being done in the mobile space. When possible, and to reduce latency and the need to exchange data with servers, parts of features are slimmed down amd then optimized to run right on the mobile device.

I was in the EU a while back and got stuck with some terrible 100mb data plan. That didn't last long.

When using maps, I got a notice that I was using reduced functionality. Prior to my trip, I downloaded a cache of the region I would be in.

The reduced version omitted some place names and did not route as well as maps, big brother edition would, but was more than enough. Enough that I got the feeling most of the value is in the data, both cached at the data center, as well as real time, or near time data collected by other users in my region.

It may be we can build nets with some of the cool tools out there now, then bit reduce them to run just fine on a P2, perhaps stashing data in a big external RAM or maybe even SD card

william chan · 2018-07-26 00:34

So, the neural cough algorithm for the P2 should be as follows

a. Enroll a person's coughs by storing the FFT peaks at the loudest point and link to his user ID.
b. Use a proximity sensor to detect when a person is in front of the cough reader and when he has left.
c. Each time a user coughs but does not match, the mismatched FFT signature is stored in a temporary storage.
d. Once a cough matches, all the previous mismatched FFT signatures will be enrolled as that user's additional cough templates.
e. However, if any new template happens to match another user's template, the new template will be discarded.
f. If proximity sensor detects that a person has left without success, the temporary templates will be deleted.

This allows the cough reader to learn all the possible coughs of a person.

Peter Jakacki · 2018-07-26 00:50

What's neural about that? Isn't it just a simple database with multiple signatures? A real neural algorithm would be able to match additional coughs that haven't yet been enrolled.

It seems your topic should have been "cough identification" rather than "Voice Identification". FFTs are well suited to processors with FPUs such as M4 or M7 ARM chips whereas the P2 could do FFT but not as well.

TonyB_ · 2018-07-26 01:06

Nobody really knows just how much the P2 will be able to do ...

Dave Hein · 2018-07-26 01:17

If it can be clocked at 200 MHz one cog will do 100 MIPS, and 8 cogs will do 800 MIPS. We should be able to predict how fast an FFT or a FIR filter will run on the P2. Some pretty amazing things have been done on the P1 with one-fourth the MIPS and no hardware multiplier. The P2 should be capable of even greater things.

cgracey · 2018-07-26 02:12

The CORDIC makes FFT's really simple. Just rotate (0,sample) by your angle and you have the scaled COS and SIN values to sum into your (X,Y) accumulators.

Dave Hein · 2018-07-26 02:34

If 32-bit precision is needed the CORDIC method will definitely be the way to go. However, I think when lower precision is adequate the 16-bit multiplier will be faster. I suppose both the CORDIC and the 16-bit multiplier could be used together to get even higher speed.

cgracey · 2018-07-26 05:20

If you can feed and retrieve data to and from the CORDIC every 8 or 16 clocks, that would be really fast.

william chan · 2018-07-26 06:12

Can the FFT be done in Spin2 ?

So let's say to recognize a person from the cough sound, we need to to find amplitude of frequencies from 0Hz to 20Khz,
divided into 40 discrete frequency spacing of 500Hz each. ( is 500Hz too wide? )
So we prepare 40 accumulators to accumulate the
product of sine and cosine of each frequency with the actual sample amplitudes every 90 degrees phase.
Finally, we sort by the accumulators by the highest to lowest values to get the bio-signature.
Sounds simple for the P2 right?

jmg · 2018-07-26 06:32

william chan wrote: »

Can the FFT be done in Spin2 ?

Of course, but the smarter question is why use an interpreted language, when you can use a compiled one ?

This thread
https://forums.parallax.com/discussion/166671/p2-interpreter-comparison
already has some FFT benchmarks

ersmith:
------------------ xxtea -----------------------
speed is cycles to decode; lower is better
size is size of xxtea.o + main.o

     	     	  SPEED		  SIZE
p1gcc LMM:	 21984 cycles	912 bytes
p1gcc CMM:	362544 cycles	384 bytes
zog_p2:		168964 cycles	623 bytes
riscvemu_p2:    202656 cycles   708 bytes
riscvjit_p2:	 59755 cycles	708 bytes

----------------- fftbench -----------------------
speed is microseconds to complete FFT; lower is better
size is size of fftbench.o

p1gcc LMM:	138096 us	 2052 bytes
p1gcc CMM:	567318 us	  852 bytes
zog_p2:		239650 us	 1243 bytes
riscvemu_p2:	232937 us	 1536 bytes
riscvjit_p2:	 60964 us	 1536 bytes

ersmith: I have ported the FastSpin compiler to P2. It's not quite an apples to apples comparison, since we're looking at Spin source code that's been compiled to run directly in hubexec mode, rather than C source code that's being interpreted, but the times for P2 are:
dhry:      9523 dhrystones/sec
xxtea:    43651 cycles
fftbench: 42681 us

Heater. · 2018-07-26 06:49

william chan,

Can the FFT be done in Spin2 ?

Sure, why not?

My FFT comes in Spin and C versions and runs fine on the P1, google "heater_fft" and fft_bech". The guts of the Spin version is written in assembler but I include PASM as part of Spin as it's all integrated into the same source file.

The C version of heater_fft can spread the workload over 2 or 4 COGs automatically for extra performance.

If you want to do an FFT from 0Hz to 20KHz you will need 40 thousand samples and hence frequency bins. That's OK, the P2 has space. But for speech 8 or 10K samples per second would do. And of course you need not use a whole seconds worth of samples at a time. Perhaps only 100ms worth would do.

The problem I had was running out of bits. The heater_FFT uses 32 bit fixed point arithmetic with 12 bits right of the binary point. For large sample sets you end up adding together a lot of samples and overflowing the number range. Hence I only do an FFT over 1024 samples. Or maybe I was just looking at the problem wrongly.

I'm not convince that taking just the FFT is enough to recognize a voice.

How would one do the signature matching?

1) Take the FFT.
2) Normalize it so that the total area under the curve is 1.
3) Calculate the root-mean-squared difference between it and a normalized FFT of the reference voice.

See here for solutions like that, and others, https://www.researchgate.net/post/How_can_I_compare_the_shape_of_two_curves

ErNa · 2018-07-26 07:28

the last I would do in voice recognition is a fft. As we do not understand how a fft works and what the prerequisites are. Those are not given for voice signals. That's why siri, alexa and co do not really work. Just recognize words, common phrases. How could it be, we find a solution, if all the efforts of the science world show such poor results? I fear only by changing the paradigms and going a completely different way.

Roy Eltham · 2018-07-26 07:52

Alexa lets you train it with your specific voice and it will recognize you verses others.
However, it doesn't need training to recognize what you are saying.

evanh · 2018-07-26 08:47

Erna,
I think you might be confusing Fast Fourier Transform with Artificial Neural Network. I would think a FFT will always be the first stage in building a compact signature.

Heater. · 2018-07-26 09:54

jmg,

...the smarter question is why use an interpreted language, when you can use a compiled one ?

Good question.

Oddly enough all the examples in the benchmarks are interpreted at runtime from some kind of non-native instructions.

jmg · 2018-07-26 10:11

Heater. wrote: »

Oddly enough all the examples in the benchmarks are interpreted at runtime from some kind of non-native instructions.

Yes, I guess that was easy to get working, and numbers should only improve from there.
I thought I had seen a native (compiled to PASM) benchmark go past, but cannot find it easily...

Heater. · 2018-07-26 10:17

Now that the P2 can run code directly from HUB we are in a new world of native Propeller code.

Is there anything that compiles to native P2 HUB code? I don't recall seeing it go by.

Hmm... actually, the FFT above can be compiled with fache. Which means it's inner loop(s) are compiled to native P1 code that runs in COG directly, even if the rest of the code is LMM.

David Betz · 2018-07-26 10:40

jmg wrote: »

william chan wrote: »

Can the FFT be done in Spin2 ?

Of course, but the smarter question is why use an interpreted language, when you can use a compiled one ?

As Eric Smith has proven with his P2 Spin compiler, Spin does not have to be interpreted. In fact, I would argue that there is no reason ever to use an interpreted version of Spin unless your program is so large that it won't fit in memory without using the more compact byte codes that the interpreter uses. My guess is that 90% of P2 applications will fit comfortably into memory when compiled to native code.

TonyB_ · 2018-07-26 11:07

The Discrete Tchebichef Transform is an alternative to the Fast Fourier Transform for speech recognition [1]
and to the Discrete Cosine Transform for image compression [2].

[1] https://scialert.net/fulltext/?doi=jas.2013.465.471
[2] https://spectrum.library.concordia.ca/976081/

Heater. · 2018-07-26 11:13

Interesting Tony_B,

If you happen to have the code to either of those, in C, Spin or PASM, that would be great.

lardom · 2018-07-26 11:48

Heater. wrote: »

Now that the P2 can run code directly from HUB we are in a new world of native Propeller code.

Wow...

TonyB_ · 2018-07-26 12:04

Heater. wrote: »

Interesting Tony_B,

If you happen to have the code to either of those, in C, Spin or PASM, that would be great.

I discovered the DTT yesterday!

Will P2 be able to do professional Voice Identification?

Comments