SPeechIN: (WiP) using the Poor Man's DCT - need Testers!

lonesock · 2009-10-17 02:39

Hi, All.

I've been intrigued by the challenge of getting simple speech recognition working on the prop (especially fitting it into Hanno's additional restrictions). I had an idea about this, hence this post [noparse][[/noparse]8^)

Here are basic concepts:

* The Walsh Transform converts a time series to the frequency domain (sort of), using only additions and subtractions (The Poor Man's DCT)

* The Walsh Transform Matrix is moderately difficult to generate on the fly

* Here's the new idea: you can approximate the Walsh Transform Matrix by using Bresenham's Line Algorithm, and I'm calling this the Pseudo-Walsh Transform. I haven't seen any literature on this, so as far as I know it's new [noparse][[/noparse]8^)

* The transform matrix is symmetric, so instead of waiting to capture N samples then performing a O(N*N) transform, you can accumulate the contribution to each "bucket" one sample at a time.

That's the basic idea. I'll be working on it as time permits, and provide some source code and such when I'm done. If anyone has any comments or suggestions, I would very much welcome them. Attached is some simple C++ code to generate a bunch of these transform matrices, so I could test them in Octave (or MATLAB if you have it).

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Post Edited (lonesock) : 11/2/2009 11:18:21 PM GMT

lonesock · 2009-10-17 02:39

[noparse][[/noparse]Reserved for code]

1st version that works for me!
* I digitized audio from my headset mic at 8 bits and 8kHz and sent it over a serial l ink down to the board, and it worked! At least I could watch the frequency change as I whistled into the mic.
* Next, I attempted to modify the code so that it will use the microphone input from the demo board. I do _NOT_ actually have a demo board, so this is untested. Could someone please pretty pretty please test this for me? [noparse][[/noparse]8^)

2nd version:
* turns out you need to enable the output pin as an output [noparse][[/noparse]8^), should work much better
* changed the background colors so it's obvious if the VGA display is working
* I no longer display the 1st energy bucket (the DC component) as it could swamp the other data
* I now sample at 8192 Hz, instead of just randomly "whatever speed this happens to be"

3rd version:
* picked a name: "SPeechIN" (hey I have a name, I must be half way there! [noparse][[/noparse]8^)
* process the Pseudo-Walsh transformed data:
-- centroid of the 1st and 2nd formant bands (approximate), stored as 2 bytes
-- ratio of power in the F2 band to the total power (from DC up through the F2 band, attempting to catch when there is very little energy in F2 (as in /oo/)), stored as 1 byte
-- ratio of power above F2 to the total power (attempting to discern sibilance), also 1 byte
* display the sideways transform a bit smaller, and added a square to plot the F1,F2 centroid data point in real-time

4th version:
* Added a basic auto-threshold for the audio level
* Store the 4 values computed per block, for blocks above the threshold...REALLY NEEDS HYSTERESIS
* Once the level drops below the threshold, re-samples those values up or down to a vector of size 32...to be my "word signature"...REALLY NEEDS AVERAGING
* displays the current signature in the bottom right of the screen (really slows down the display)
* the sampled audio is also output on the headphone jack

5th version:
* Hysteresis on the threshold seems to be working
* Added some dotted lines on the "word signature" display (bottom right) to show boundaries and midline
* The spoken word should be >= ('=>' if you prefer [noparse][[/noparse]8^) 1/2 second

Thanks for your patience, everybody!

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Post Edited (lonesock) : 11/2/2009 11:16:54 PM GMT

Microcontrolled · 2009-10-17 11:44

Yes. I have a Demo Board. I'm just glad that someone is acomplishing this after I was not able to.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Computers are microcontrolled.

Robots are microcontrolled.
I am microcontrolled.

But you·can·call me micro.

Want to·experiment with the SX or just put together a cool project?
SX Spinning light display·

Drone · 2009-10-17 12:48

Jonathan, This is nice work - congratulations. I have not seen this approach before. At the moment I don't have a mic or VGA to test, sorry.

Thanks, David

Toby Seckshund · 2009-10-17 14:12

Well done for getting on with the Walsh funtions. I had a C64 that used them for speech generation but when I looked at them, when this use was muted, I ran away from all those equations!

Perhaps there is "Walsh funtions for dummies"

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Style and grace : Nil point

pmrobert · 2009-10-17 14:26

I loaded this into a demo board. There is no output to a 1024x768 VGA monitor. I'm not familiar with the video driver you're using so am unable to quickly help with that issue.

-Mike

lonesock · 2009-10-17 14:34

@microcontrolled: Hey, speech recognition isn't accomplished yet, this is just the beginning of a framework showing a fast way to get lots of frequency data as a first step. You certainly have more experience than I do at the actual speech recognition part! Feel free to play with the code (though it's pretty ugly at this point

, and I certainly welcome your input!

@David: Thanks! I think I'll try soldering up a copy of the ADC circuit (as used on the demo board) on a protoboard, knowing full well that I will get very few good bits [noparse][[/noparse]8^)

@ Toby: Thanks! If this all works out I'll probably do a little write-up on the pseudo-Walsh transform.

@Mike: Thanks for trying it...it's quite possible that your VGA is fine and my hacked code for reading the ADC is wrong. If you remove the line "rnd := phsa~", and instead uncomment the line "?rnd", hopefully you will see a white-noise type spectrum (sideways

on the display. I should probably pick a different background color than black so it is obvious if the display is at least working.

@everybody: My plan is:
* get the PASM cog to also do the regular sampling of the ADC at a target frequency (probably somewhere in the 8-16kHz range)
* after N samples have been collected / distributed, take all the frequency information and sum the energy into (probably) 4 bands, targeting the approximate frequencies of the first 3 formant bands, and a high frequency band to catch sibilance. Use 1 byte per band, so the entire N-sample window is stored as a single 32-bit long.
* use a time & energy threshold to determine if I should be storing this sample
* use a simple state machine to see if we are in the middle of storing, or just finished (i.e. the energy has dropped below the threshold for a certain period of time)
* if we just finished, find a match...this is the hard part...I need to do some research as I have never implemented Dynamic Time Warping or Hidden Markov Models. Hopefully one of these or both can be done quickly in PASM and still fit in my remaining Cog code space.

Everything is pretty simple until that last magic bullet point. [noparse][[/noparse]8^) I hope to have a semi decent & clean framework up in a week or two/

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Dave Hein · 2009-10-17 15:16

So if I understand correctly you are implementing the WHT using an NxN matrix multiply.· Of course, with the WHT this only requires additions and subtractions.· The matrix multiply method does allow you to do N of the operations every time a sample comes in.· So for a block of 1,024 samples there would be a total of 1,048,576 operations, where 1024 could be done every sample time.

The alternative would be to double-buffer the input, and use two cogs.· One cog would fill an input buffer while the other cog perform the transform.· This way you could use a fast algorithm, which would only require 10,240 additions and subtractions for a block of 1024 samples.

Dave

pmrobert · 2009-10-17 16:49

@Jonathan - I do indeed see the spectrum you predicted.

-Mike

lonesock · 2009-10-17 18:01

@Dave: Sort of. I'm using Bresenham's line algorithm to decide when to swap signs, instead of Walsh functions. It only takes a few PASM instructions, and I do not need to either store the Walsh functions in RAM, or generate them on the fly. Also, I think it turns out to have better frequency data after the transform, as the Walsh functions have some weird frequency aliasing (the actual Walsh functions for large blocks do not switch as evenly...it's more of a gray code than a "evenly toggle between 1 and -1", so the Bresenham method I'm using tends to look more like if you took the sign of each DCT component). Unfortunately there is no simple "Inverse Pseudo-Walsh Transform", so it can't be easily used for audio compression. And you are absolutely right that the FW(H)T would trump my implementation's speed for large block sizes. Here is my reasoning on the matter:

* audio (speech) input, probably a max of 16 kHz sample rate
* lowest frequency I care about is ~250 Hz, yielding a block size of ~64
* Hanno's speech recognition challenge specified only 1 cog, so I have quite a bit more to cram into the one I'm already using, and can't spawn a second
* as long as I'm not doing anything else with my compute cycles between each audio sample, I might as well build my solution as I go, because even though my total operations' cost is O(N^2), after the last sample the transformed data will be ready in O(N) instead of O(NlogN) time, which in terms of latency is still a win.
* at 80 MHz clock speed, and 16 kHz sample rate, I can still pull off about a block size of 200
* which reminds me, this "Pseudo Walsh" transform can have any even block size, it is not limited to powers of 2 as the FW(H)T is. This lets me scale the block-size as desired to match the propeller's clock speed, etc.

@Mike: OK, that means I'm grabbing the ADC values wrong (or possibly the DC component of the transform is swamping everything else...is there a single line drawn at the top of the display, per chance?)

Thanks for testing and for the comments, everybody!

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

pmrobert · 2009-10-17 19:10

No single line. The display is a number of horizontal bars with wildly varying lengths. It does not look random but is way too fast for me to be sure!
-Mike

Dave Hein · 2009-10-17 19:50

lonesock said...
@Dave: Sort of. I'm using Bresenham's line algorithm to decide when to swap signs, instead of Walsh functions. It only takes a few PASM instructions, and I do not need to either store the Walsh functions in RAM, or generate them on the fly. Also, I think it turns out to have better frequency data after the transform, as the Walsh functions have some weird frequency aliasing (the actual Walsh functions for large blocks do not switch as evenly...it's more of a gray code than a "evenly toggle between 1 and -1", so the Bresenham method I'm using tends to look more like if you took the sign of each DCT component). Unfortunately there is no simple "Inverse Pseudo-Walsh Transform", so it can't be easily used for audio compression. And you are absolutely right that the FW(H)T would trump my implementation's speed for large block sizes. Here is my reasoning on the matter:

Jonathan

OK, I understand now.· I can see how this transform could be used to produce a spectrum.· It might be interesting to try a similar transform based on the sign bits of the sines and cosines of a DFT.
I'm not a C++ programmer, so I created a C version of your program to try it out.· It is attached below for any other C programers that may be interested in trying it.
Dave

Post Edited (Dave Hein) : 10/17/2009 7:55:34 PM GMT

hover1 · 2009-10-17 20:25

Jonathan,
I think you were asking Mike if there was one line drawn in the original program? I do not see one on my demo board.

[noparse][[/noparse]quote="lonesock"]@Mike: Thanks for trying it...it's quite possible that your VGA is fine and my hacked code for reading the ADC is wrong. If you remove the line "rnd := phsa~", and instead uncomment the line "?rnd", hopefully you will see a white-noise type spectrum (sideways

on the display. I should probably pick a different background color than black so it is obvious if the display is at least working.[noparse][[/noparse]/quote]

I did verify that your change will produce the random horizontal lines on my display, as predicted.

Jim

@Mike: OK, that means I'm grabbing the ADC values wrong (or possibly the DC component of the transform is swamping everything else...is there a single line drawn at the top of the display, per chance?)

Thanks for testing and for the comments, everybody!

Jonathan

Post Edited (hover1) : 10/17/2009 8:38:07 PM GMT

lonesock · 2009-10-17 22:59

Thanks for your help everybody! I've updated the code in the second post, hopefully it will work now! (I had forgotten to enable an output pin!)

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Post Edited (lonesock) : 10/17/2009 11:04:13 PM GMT

hover1 · 2009-10-17 23:28

That's go it on the the Demo Board. A good horizontal signal based on Mic input. Full left hand side, top to bottom.
Although there is a checkered board background. Did you mean to have that there?
Good stuff,
Jim

lonesock said...
Thanks for your help everybody! I've updated the code in the second post, hopefully it will work now! (I had forgotten to enable an output pin!)

Jonathan

lonesock · 2009-10-17 23:35

hover1 said...
That's go it on the the Demo Board. A good horizontal signal based on Mic input. Full left hand side, top to bottom.
Although there is a checkered board background. Did you mean to have that there?
Good stuff,
Jim

Thanks! Yep, the pattern is an easy way to make sure the video output is working, even if there is no signal displayed. I do apologize for the ugliness, though.

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

hover1 · 2009-10-18 00:15

Works for me, good call.

Jim

lonesock said...

hover1 said...
That's go it on the the Demo Board. A good horizontal signal based on Mic input. Full left hand side, top to bottom.
Although there is a checkered board background. Did you mean to have that there?
Good stuff,
Jim

Thanks! Yep, the pattern is an easy way to make sure the video output is working, even if there is no signal displayed. I do apologize for the ugliness, though.

Jonathan

Hanno · 2009-10-18 04:39

Great work Jonathan! I won't pretend to understand the mathematics, but am very happy to see you do things in the frequency domain and start of by measuring things. I've previously said I would close the challenge (and try it myself), but it's now extended- good luck! Would be great if you could satisfy the original conditions, but if you're stuck feel free to experiment with multiple cogs. I would love to have someone win the challenge!

Here are the original challenge terms:
I would love to include simple speech recognition in 12Blocks- so, here's the challenge:
-Hardware is limited to the Parallax DemoBoard
-Uses 1 cog and less than 5KB global ram
-Uses 1 spin variable to indicate what word was recognized
-Must understand either: "1,2,3,4,5,6,7,8,9,10" or "up,down,left,right,yes,no". I should be able to take the code, speak the items in any order and not see a mistake. It's ok if I have to repeat myself, speak carefully, be in a quiet room...
-Code must be MIT license
Winner gets a ViewPort Ultimate license...

I believe RAM usage has been increased to 10KB...
Good luck!
Hanno

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Co-author of the official Propeller Guide- available at Amazon
Developer of ViewPort, the premier visual debugger for the Propeller (read the review here),
12Blocks, the block-based programming environment
and PropScope, the multi-function USB oscilloscope/function generator/logic analyzer

Hanno · 2009-10-18 04:41

Sorry, RAM was increased to 15KB...
http://forums.parallax.com/forums/default.aspx?f=25&p=2&m=365583#m379462
Hanno

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Co-author of the official Propeller Guide- available at Amazon
Developer of ViewPort, the premier visual debugger for the Propeller (read the review here),
12Blocks, the block-based programming environment
and PropScope, the multi-function USB oscilloscope/function generator/logic analyzer

lonesock · 2009-10-18 19:16

Hanno said...
...
I've previously said I would close the challenge (and try it myself), but it's now extended- good luck!
...
Hanno

Thanks! Awesome, I'll keep everybody informed of any progress. (Expect a few days between updates, life's hectic right now [noparse][[/noparse]8^)

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Dave Hein · 2009-10-19 18:55

I looked at the WHT in more detailed and determined that it is essentially the binary version of the discrete Fourier transform.· The basis vectors for the DFT are sine and cosine functions.· The DFT basis vectors contain zero values at certain points where the sines or cosines go from positive to negative, and vice-versa.· However, if the sines and cosines are shifted by half a sample there are no zero values.· If the positive values are replaced by +1 and the negative values are replaced by -1 the result is the Walsh Hadamard transform.· I verified this for several sizes of transforms.

This means that consecutive values of WHT coefficients represent the sine and cosine of the same frequency.· A spectrum could be computed by combining each pair of coefficients in the same way that the sines and cosines are combined to compute a magnitude -- that is, the square root of the sum of the squares.· This could be approximated by summing the absolute values.

I haven't tried generating a spectrum from WHT coefficients, but when I get a chance I'll try it out.

Dave

Lawson · 2009-10-19 20:12

lonesock said...
@Dave: Sort of. I'm using Bresenham's line algorithm to decide when to swap signs, instead of Walsh functions. It only takes a few PASM instructions, and I do not need to either store the Walsh functions in RAM, or generate them on the fly. Also, I think it turns out to have better frequency data after the transform, as the Walsh functions have some weird frequency aliasing (the actual Walsh functions for large blocks do not switch as evenly...it's more of a gray code than a "evenly toggle between 1 and -1", so the Bresenham method I'm using tends to look more like if you took the sign of each DCT component). Unfortunately there is no simple "Inverse Pseudo-Walsh Transform", so it can't be easily used for audio compression.

Correct me if I'm wrong, but you're implimenting this as a matrix multiply of the Pseudo-Walsh matrix times a vector of input samples? In that case wouldn't multiplying the resulting spectrum by the inverse of the Pseudo-Walsh matrix re-create the original data? Wonder what the inverse Pseudo-Walsh matrix would look like?

My two bits,
Lawson

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Lunch cures all problems! have you had lunch?

lonesock · 2009-10-19 21:21

@Dave: Nice sleuthing! Definitely try it out at larger sizes (e.g. 128^2) if you haven't already. That's where I noticed the Walsh matrix being less regular. I forgot to mention, the Pesudo-Walsh matrix matches the Walsh matrix exactly up to 8x8, but from 16x16 and above it doesn't.

@Lawson: You are exactly right, you could just use the inverse of the matrix. It's just not a pretty integer-only type thing. One nice property of the Walsh matrix proper, is that W*W = I * (size of W). So if you take the Walsh matrix of size 16, and multiply it by itself, you get the identity matrix * 16. This means that the inverse Walsh transform is exactly the same as the forward Walsh transform times a constant, which is extremely convenient. Using the Pseudo-Walsh transform twice does almost the same thing, except there are small off-diagonal terms, which translate into very obvious jumps in the re-constructed time series. If you want to go that route, it is probably best to actually use the fast Walsh transform directly, and not this thing I'm using [noparse][[/noparse]8^)

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

lonesock · 2009-10-23 21:12

Hi, Everybody.

Here's the next update, check post #2 for the code. I do some processing on each sample block now, reducing it to 4 bytes, stored as a long. Next up is to collect a series of these to make a normalized "word signature" vector.

Oh yeah, I picked a name for this library: "SPeechIN" [noparse][[/noparse]8^)

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Cluso99 · 2009-10-23 21:30

Lonesock: Wow, this is fantastic

Way over my head though!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

lonesock · 2009-10-23 22:28

Cluso99 said...
Lonesock: Wow, this is fantastic Way over my head though!

Thanks!

Jonathan

(Over your head...Ha! Maybe outside your area of expertise [noparse][[/noparse]8^)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

lonesock · 2009-10-30 06:14

It occurred to me that I could get a better approximation of a cosine than just +1 or -1, by using a sequence of 6 values instead of 2. Namely:

1 1/2 -1/2 -1 -1/2 1/2

It turns out that exactly matches a six point cosine...something like cos( N*1.0472 ) for N from 0..5 IIRC. And of course the division by 2 is easy using SAR.

When I code this in PASM, it looks like it will take about 20 instructions per bucket, instead of the current 6. Which in turn means I can only do a maximum transform block size of about 60, using an 80 MHz prop and running 16 kHz sample rate. If I drop the sample rate to 8000, I can use a block size of 120. Also, now that the transform uses a 6-sample cosine function instead of a 2 sample cosine function, it is probably best to use a block multiple of 6 instead of 2, so 120 works out nicely.

This would be more of a Pseudo-DCT instead of Pseudo-Walsh, I think. The Pseudo-Walsh actually had decent energy compaction properties (for a sine "chirp" there was always a peak at the right bucket, and the rest of the energy did show up in scattered buckets (I don't say "random", as there was definitely a pattern), and the remainder of the energy peaks were almost always < 25% of the maximum peak's value. I'm hoping that by going to the 6-point cosine approximation I get beter energy compaction, especially for single frequencies.

I think, though, that the original method may indeed be good enough, so this may all be academic. It was just a neat(-ish) concept, and wanted to share, in case anyone else wanted to do something like a real-time frequency transform for another application. The PASM code is pretty cool too, so I'll post when I get the next rev up and running, secretly hoping for someone to optimize it further so I can use a larger block size at 16 kHz [noparse][[/noparse]8^)

thanks,
Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

lonesock · 2009-10-30 13:07

Here is the untested distribute function. It takes 13 instructions per bucket, so much better than the 20 I was fearing [noparse][[/noparse]8^) As soon as I get this tested I will update the source in the 2nd post.

distribute_sample
        ' start with the first bucket
        movd :add_to_bucket,#buckets
        ' my accumulator is reset
        mov accum,Nminus1
        shr accum,#1
        ' how many buckets?
        mov f_left,NumBuckets
        ' for my 6-point cosine, we add 3*sample_ID
        ' (start at 1, not 0...see below)
        mov six_point,#1
        mov step_add,sample_ID
        add step_add,sample_ID
        add step_add,sample_ID
        ' do the distribution
:this_bucket
        { NOTE:
          For the six point cosine we are using values:
          1  1/2  -1/2  -1  -1/2  1/2          
          which corresponds to cos(N*Pi/3).  However, to
          take advantage of mirroring the sign, I am using
          a rotated version (but starting at 1):
          1/2  1  1/2  -1/2  -1  -1/2  
          This way I can swap the sign every 3 steps, and
          use 1/2 if the index is 0 or 2.               }
        cmp six_point,#1 wz
        mov tmp,sample
if_nz   sar tmp,#1
:add_to_bucket
        adds 0-0,tmp
        add :add_to_bucket,incDest 
        ' do a single Bresenham step
        add accum,step_add
        ' (now a posssible 3 notches per step)
        cmpsub accum,Nminus1times2 wc
if_c    add six_point,#2
        cmpsub accum,Nminus1 wc
if_c    add six_point,#1
        ' update the sign of the sample is necessary
        cmpsub six_point,#3 wc
if_c    neg sample,sample
        ' go back for more
        djnz f_left,#:this_bucket
        ' done
distribute_sample_ret
        ret

If anybody sees a way to speed this up I would be grateful!

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

lonesock · 2009-11-02 16:50

OK, I updated the 2nd post with the coding results from the weekend.

Basically I've started work on getting a signature per word. There are 2 things it still really needs: hysteresis on the thresholding, and averaging/interpolation when resampling the "word signature" to a constant length. Oh, and this version uses the 6-point cosine approximation instead of the 2-point approximation in use previously. It cuts down on aliasing a bit. If we need to either sample faster than 8kHz, or use on a system with a slower clock I'd definitely go back to the 2-point.

Once the "word signature" is working well, then I can start in on the comparison against pre-stored templates to find the best word match. I'm sure lot's of things will still need to be tuned and tweaked [noparse][[/noparse]8^)

thanks,
Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

lonesock · 2009-11-02 23:25

I know, I know, this is turning into more of a blog than a thread

The latest version of the code does a decent job of detecting word boundaries and generating a word signature (shown in the bottom right). There are no word templates to compare against yet, but please test the current form and give any feedback or observations. Here's the procedure.

* run the code on the demo board, with a VGA monitor connected (and optional headphones)
* say a word (isolated, > 0.5 and < 4 seconds in length)
* look at the bottom right corner of the screen, showing four traces, which represent (from left to right):
-- 1 - The F1 centroid, showing a dominant frequency in the first formant band (250-950 Hz)
-- 2 - The F2 centroid, showing a dominant frequency in the second formant band (950-2500 Hz)
-- 3 - The ratio of energy in the F2 band to the total energy in F1 and F2. (so centered means 1/2 and 1/2, to the right means F2 has more energy, left means F1 has more energy)
-- 4 - The ratio of the energy above F2 to the total energy in and above F1 (this is to approximate sibilance...look for higher values when you use "s" and "th", etc.)

So, any feedback? Does the word detection work for you? Too sensitive, or not sensitive enough? Do the resulting "word signature" squiggles make sense, and look repeatable for a given word?

thanks,
Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

SPeechIN: (WiP) using the Poor Man's DCT - need Testers!

Comments