SPeechIN: (WiP) using the Poor Man's DCT - need Testers!
lonesock
Posts: 917
Hi, All.
I've been intrigued by the challenge of getting simple speech recognition working on the prop (especially fitting it into Hanno's additional restrictions). I had an idea about this, hence this post [noparse][[/noparse]8^)
Here are basic concepts:
* The Walsh Transform converts a time series to the frequency domain (sort of), using only additions and subtractions (The Poor Man's DCT)
* The Walsh Transform Matrix is moderately difficult to generate on the fly
* Here's the new idea: you can approximate the Walsh Transform Matrix by using Bresenham's Line Algorithm, and I'm calling this the Pseudo-Walsh Transform. I haven't seen any literature on this, so as far as I know it's new [noparse][[/noparse]8^)
* The transform matrix is symmetric, so instead of waiting to capture N samples then performing a O(N*N) transform, you can accumulate the contribution to each "bucket" one sample at a time.
That's the basic idea. I'll be working on it as time permits, and provide some source code and such when I'm done. If anyone has any comments or suggestions, I would very much welcome them. Attached is some simple C++ code to generate a bunch of these transform matrices, so I could test them in Octave (or MATLAB if you have it).
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Post Edited (lonesock) : 11/2/2009 11:18:21 PM GMT
I've been intrigued by the challenge of getting simple speech recognition working on the prop (especially fitting it into Hanno's additional restrictions). I had an idea about this, hence this post [noparse][[/noparse]8^)
Here are basic concepts:
* The Walsh Transform converts a time series to the frequency domain (sort of), using only additions and subtractions (The Poor Man's DCT)
* The Walsh Transform Matrix is moderately difficult to generate on the fly
* Here's the new idea: you can approximate the Walsh Transform Matrix by using Bresenham's Line Algorithm, and I'm calling this the Pseudo-Walsh Transform. I haven't seen any literature on this, so as far as I know it's new [noparse][[/noparse]8^)
* The transform matrix is symmetric, so instead of waiting to capture N samples then performing a O(N*N) transform, you can accumulate the contribution to each "bucket" one sample at a time.
That's the basic idea. I'll be working on it as time permits, and provide some source code and such when I'm done. If anyone has any comments or suggestions, I would very much welcome them. Attached is some simple C++ code to generate a bunch of these transform matrices, so I could test them in Octave (or MATLAB if you have it).
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Post Edited (lonesock) : 11/2/2009 11:18:21 PM GMT
Comments
1st version that works for me!
* I digitized audio from my headset mic at 8 bits and 8kHz and sent it over a serial l ink down to the board, and it worked! At least I could watch the frequency change as I whistled into the mic.
* Next, I attempted to modify the code so that it will use the microphone input from the demo board. I do _NOT_ actually have a demo board, so this is untested. Could someone please pretty pretty please test this for me? [noparse][[/noparse]8^)
2nd version:
* turns out you need to enable the output pin as an output [noparse][[/noparse]8^), should work much better
* changed the background colors so it's obvious if the VGA display is working
* I no longer display the 1st energy bucket (the DC component) as it could swamp the other data
* I now sample at 8192 Hz, instead of just randomly "whatever speed this happens to be"
3rd version:
* picked a name: "SPeechIN" (hey I have a name, I must be half way there! [noparse][[/noparse]8^)
* process the Pseudo-Walsh transformed data:
-- centroid of the 1st and 2nd formant bands (approximate), stored as 2 bytes
-- ratio of power in the F2 band to the total power (from DC up through the F2 band, attempting to catch when there is very little energy in F2 (as in /oo/)), stored as 1 byte
-- ratio of power above F2 to the total power (attempting to discern sibilance), also 1 byte
* display the sideways transform a bit smaller, and added a square to plot the F1,F2 centroid data point in real-time
4th version:
* Added a basic auto-threshold for the audio level
* Store the 4 values computed per block, for blocks above the threshold...REALLY NEEDS HYSTERESIS
* Once the level drops below the threshold, re-samples those values up or down to a vector of size 32...to be my "word signature"...REALLY NEEDS AVERAGING
* displays the current signature in the bottom right of the screen (really slows down the display)
* the sampled audio is also output on the headphone jack
5th version:
* Hysteresis on the threshold seems to be working
* Added some dotted lines on the "word signature" display (bottom right) to show boundaries and midline
* The spoken word should be >= ('=>' if you prefer [noparse][[/noparse]8^) 1/2 second
Thanks for your patience, everybody!
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Post Edited (lonesock) : 11/2/2009 11:16:54 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Computers are microcontrolled.
Robots are microcontrolled.
I am microcontrolled.
But you·can·call me micro.
Want to·experiment with the SX or just put together a cool project?
SX Spinning light display·
Thanks, David
Perhaps there is "Walsh funtions for dummies"
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Style and grace : Nil point
-Mike
@David: Thanks! I think I'll try soldering up a copy of the ADC circuit (as used on the demo board) on a protoboard, knowing full well that I will get very few good bits [noparse][[/noparse]8^)
@ Toby: Thanks! If this all works out I'll probably do a little write-up on the pseudo-Walsh transform.
@Mike: Thanks for trying it...it's quite possible that your VGA is fine and my hacked code for reading the ADC is wrong. If you remove the line "rnd := phsa~", and instead uncomment the line "?rnd", hopefully you will see a white-noise type spectrum (sideways on the display. I should probably pick a different background color than black so it is obvious if the display is at least working.
@everybody: My plan is:
* get the PASM cog to also do the regular sampling of the ADC at a target frequency (probably somewhere in the 8-16kHz range)
* after N samples have been collected / distributed, take all the frequency information and sum the energy into (probably) 4 bands, targeting the approximate frequencies of the first 3 formant bands, and a high frequency band to catch sibilance. Use 1 byte per band, so the entire N-sample window is stored as a single 32-bit long.
* use a time & energy threshold to determine if I should be storing this sample
* use a simple state machine to see if we are in the middle of storing, or just finished (i.e. the energy has dropped below the threshold for a certain period of time)
* if we just finished, find a match...this is the hard part...I need to do some research as I have never implemented Dynamic Time Warping or Hidden Markov Models. Hopefully one of these or both can be done quickly in PASM and still fit in my remaining Cog code space.
Everything is pretty simple until that last magic bullet point. [noparse][[/noparse]8^) I hope to have a semi decent & clean framework up in a week or two/
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
The alternative would be to double-buffer the input, and use two cogs.· One cog would fill an input buffer while the other cog perform the transform.· This way you could use a fast algorithm, which would only require 10,240 additions and subtractions for a block of 1024 samples.
Dave
-Mike
* audio (speech) input, probably a max of 16 kHz sample rate
* lowest frequency I care about is ~250 Hz, yielding a block size of ~64
* Hanno's speech recognition challenge specified only 1 cog, so I have quite a bit more to cram into the one I'm already using, and can't spawn a second
* as long as I'm not doing anything else with my compute cycles between each audio sample, I might as well build my solution as I go, because even though my total operations' cost is O(N^2), after the last sample the transformed data will be ready in O(N) instead of O(NlogN) time, which in terms of latency is still a win.
* at 80 MHz clock speed, and 16 kHz sample rate, I can still pull off about a block size of 200
* which reminds me, this "Pseudo Walsh" transform can have any even block size, it is not limited to powers of 2 as the FW(H)T is. This lets me scale the block-size as desired to match the propeller's clock speed, etc.
@Mike: OK, that means I'm grabbing the ADC values wrong (or possibly the DC component of the transform is swamping everything else...is there a single line drawn at the top of the display, per chance?)
Thanks for testing and for the comments, everybody!
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
-Mike
I'm not a C++ programmer, so I created a C version of your program to try it out.· It is attached below for any other C programers that may be interested in trying it.
Dave
Post Edited (Dave Hein) : 10/17/2009 7:55:34 PM GMT
I think you were asking Mike if there was one line drawn in the original program? I do not see one on my demo board.
[noparse][[/noparse]quote="lonesock"]@Mike: Thanks for trying it...it's quite possible that your VGA is fine and my hacked code for reading the ADC is wrong. If you remove the line "rnd := phsa~", and instead uncomment the line "?rnd", hopefully you will see a white-noise type spectrum (sideways on the display. I should probably pick a different background color than black so it is obvious if the display is at least working.[noparse][[/noparse]/quote]
I did verify that your change will produce the random horizontal lines on my display, as predicted.
Jim Post Edited (hover1) : 10/17/2009 8:38:07 PM GMT
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Post Edited (lonesock) : 10/17/2009 11:04:13 PM GMT
Although there is a checkered board background. Did you mean to have that there?
Good stuff,
Jim
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Jim
Here are the original challenge terms:
I would love to include simple speech recognition in 12Blocks- so, here's the challenge:
-Hardware is limited to the Parallax DemoBoard
-Uses 1 cog and less than 5KB global ram
-Uses 1 spin variable to indicate what word was recognized
-Must understand either: "1,2,3,4,5,6,7,8,9,10" or "up,down,left,right,yes,no". I should be able to take the code, speak the items in any order and not see a mistake. It's ok if I have to repeat myself, speak carefully, be in a quiet room...
-Code must be MIT license
Winner gets a ViewPort Ultimate license...
I believe RAM usage has been increased to 10KB...
Good luck!
Hanno
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Co-author of the official Propeller Guide- available at Amazon
Developer of ViewPort, the premier visual debugger for the Propeller (read the review here),
12Blocks, the block-based programming environment
and PropScope, the multi-function USB oscilloscope/function generator/logic analyzer
http://forums.parallax.com/forums/default.aspx?f=25&p=2&m=365583#m379462
Hanno
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Co-author of the official Propeller Guide- available at Amazon
Developer of ViewPort, the premier visual debugger for the Propeller (read the review here),
12Blocks, the block-based programming environment
and PropScope, the multi-function USB oscilloscope/function generator/logic analyzer
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
This means that consecutive values of WHT coefficients represent the sine and cosine of the same frequency.· A spectrum could be computed by combining each pair of coefficients in the same way that the sines and cosines are combined to compute a magnitude -- that is, the square root of the sum of the squares.· This could be approximated by summing the absolute values.
I haven't tried generating a spectrum from WHT coefficients, but when I get a chance I'll try it out.
Dave
Correct me if I'm wrong, but you're implimenting this as a matrix multiply of the Pseudo-Walsh matrix times a vector of input samples? In that case wouldn't multiplying the resulting spectrum by the inverse of the Pseudo-Walsh matrix re-create the original data? Wonder what the inverse Pseudo-Walsh matrix would look like?
My two bits,
Lawson
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Lunch cures all problems! have you had lunch?
@Lawson: You are exactly right, you could just use the inverse of the matrix. It's just not a pretty integer-only type thing. One nice property of the Walsh matrix proper, is that W*W = I * (size of W). So if you take the Walsh matrix of size 16, and multiply it by itself, you get the identity matrix * 16. This means that the inverse Walsh transform is exactly the same as the forward Walsh transform times a constant, which is extremely convenient. Using the Pseudo-Walsh transform twice does almost the same thing, except there are small off-diagonal terms, which translate into very obvious jumps in the re-constructed time series. If you want to go that route, it is probably best to actually use the fast Walsh transform directly, and not this thing I'm using [noparse][[/noparse]8^)
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Here's the next update, check post #2 for the code. I do some processing on each sample block now, reducing it to 4 bytes, stored as a long. Next up is to collect a series of these to make a normalized "word signature" vector.
Oh yeah, I picked a name for this library: "SPeechIN" [noparse][[/noparse]8^)
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Jonathan
(Over your head...Ha! Maybe outside your area of expertise [noparse][[/noparse]8^)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
1 1/2 -1/2 -1 -1/2 1/2
It turns out that exactly matches a six point cosine...something like cos( N*1.0472 ) for N from 0..5 IIRC. And of course the division by 2 is easy using SAR.
When I code this in PASM, it looks like it will take about 20 instructions per bucket, instead of the current 6. Which in turn means I can only do a maximum transform block size of about 60, using an 80 MHz prop and running 16 kHz sample rate. If I drop the sample rate to 8000, I can use a block size of 120. Also, now that the transform uses a 6-sample cosine function instead of a 2 sample cosine function, it is probably best to use a block multiple of 6 instead of 2, so 120 works out nicely.
This would be more of a Pseudo-DCT instead of Pseudo-Walsh, I think. The Pseudo-Walsh actually had decent energy compaction properties (for a sine "chirp" there was always a peak at the right bucket, and the rest of the energy did show up in scattered buckets (I don't say "random", as there was definitely a pattern), and the remainder of the energy peaks were almost always < 25% of the maximum peak's value. I'm hoping that by going to the 6-point cosine approximation I get beter energy compaction, especially for single frequencies.
I think, though, that the original method may indeed be good enough, so this may all be academic. It was just a neat(-ish) concept, and wanted to share, in case anyone else wanted to do something like a real-time frequency transform for another application. The PASM code is pretty cool too, so I'll post when I get the next rev up and running, secretly hoping for someone to optimize it further so I can use a larger block size at 16 kHz [noparse][[/noparse]8^)
thanks,
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
If anybody sees a way to speed this up I would be grateful!
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Basically I've started work on getting a signature per word. There are 2 things it still really needs: hysteresis on the thresholding, and averaging/interpolation when resampling the "word signature" to a constant length. Oh, and this version uses the 6-point cosine approximation instead of the 2-point approximation in use previously. It cuts down on aliasing a bit. If we need to either sample faster than 8kHz, or use on a system with a slower clock I'd definitely go back to the 2-point.
Once the "word signature" is working well, then I can start in on the comparison against pre-stored templates to find the best word match. I'm sure lot's of things will still need to be tuned and tweaked [noparse][[/noparse]8^)
thanks,
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
The latest version of the code does a decent job of detecting word boundaries and generating a word signature (shown in the bottom right). There are no word templates to compare against yet, but please test the current form and give any feedback or observations. Here's the procedure.
* run the code on the demo board, with a VGA monitor connected (and optional headphones)
* say a word (isolated, > 0.5 and < 4 seconds in length)
* look at the bottom right corner of the screen, showing four traces, which represent (from left to right):
-- 1 - The F1 centroid, showing a dominant frequency in the first formant band (250-950 Hz)
-- 2 - The F2 centroid, showing a dominant frequency in the second formant band (950-2500 Hz)
-- 3 - The ratio of energy in the F2 band to the total energy in F1 and F2. (so centered means 1/2 and 1/2, to the right means F2 has more energy, left means F1 has more energy)
-- 4 - The ratio of the energy above F2 to the total energy in and above F1 (this is to approximate sibilance...look for higher values when you use "s" and "th", etc.)
So, any feedback? Does the word detection work for you? Too sensitive, or not sensitive enough? Do the resulting "word signature" squiggles make sense, and look repeatable for a given word?
thanks,
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.