Speech Recognition

faisal · 2008-01-27 09:01

Hi all,

How would I go on about making my Stamp recognise what I say?

Any help would be appreciated.

Regards

Faisal

KMoffett · 2008-01-27 14:25

Speak binary!

Sorry Faisal, I just couldn't resist.

Ken

Mike Green · 2008-01-27 15:51

Speech recognition is very complex and requires a lot of high speed analysis and the Parallax Stamps are just not suitable for that sort of processing. You would need to use a speech recognition processor and just use the Stamp to control it. One company that makes such a system is www.sensoryinc.com/products/vr_stamp_toolkits.html. Some of the "old" personal computers like the Apple II had simple speech input and pattern analysis programs that would take audio input, normalize it, and count zero crossings to get a measure of the dominant frequency. They would look for silences (to mark the beginning and end of words), then look in a previously processed dictionary to find the best match to the input. These programs would give fair accuracy for a vocabulary of a few words, maybe up to 10-15 words as long as the words were distinctive. For practical use, they were useless, but were great for demonstrations and science fair projects.

Zoot · 2008-01-27 23:16

KMoffett -- it's not a joke. I am right now, while contemplating these very ideas, looking at a circuit and some code structure from an old robotics book that is designed to let the 'bot "hear" binary.

The author posits the idea that for simpler 'bots it may be much much much easier for you to learn to speak rudimentary binary than for the 'bot to understand English (his analogy is that you wouldn't use full, unadjusted language to speak to a small child, why would you expect your even simpler 'bot to be any different).

His approach is as follows:

- the speaker (you) speaks an arbitrary tone of some loose duration (say anywhere from 1 to 4 seconds). The rough pitch of the tone is measured and set as the "baseline" (in software ONLY for the current run)

- the 'bot will then expect 4 subsequent "tones", again the durations and spaces are *very* loose. Any tone that is HIGHER in pitch than the baseline is considered binary "1"; any tone lower in pitch than the baseline is binary "0". The 'bot then has your Nibble recognized and decoded. In a sense it's an alternate form of auto-baud-detect serial communication, where the first "start" bit is used to determine the "settings" for the 4 bits to follow.

The idea here is that no matter what the speaker's cadences and pitch (a child has a much higher pitched voice) the 'bot will adjust and decode the Nibble.

I imagine it would be quite funny to hear in person -- you'd be saying things to your 'bot like "hooooom. ummmm. ahhhh. ahhhh. ummmm" (%0110).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
When the going gets weird, the weird turn pro. -- HST

1uffakind.com/robots/povBitMapBuilder.php
1uffakind.com/robots/resistorLadder.php

Zoot · 2008-01-27 23:24

Heh -- he even has an overly cute name for it "Fredian Grammar" -- for FREquency DIfferential ANalysis.

The book is an old Tab Book -- #1141 -- "How to Build Your Own Working Robot Pet", by Frank DaCosta. Circa 1979.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
When the going gets weird, the weird turn pro. -- HST

1uffakind.com/robots/povBitMapBuilder.php
1uffakind.com/robots/resistorLadder.php

Beau Schwabe · 2008-01-28 02:30

Zoot,
·
Interesting, so the speaker (you) would sound like the person on the other end of the phone in every Charlie Brown Episode?·

·
Seriously, that's a neat idea.· The amount of recognition that you want to do strictly depends on the vocabulary that you want to implement, as well as your memory limitations and processing power.· When I was a kid I wrote a program with an ATARI computer that could distinguish the difference between "YES" and "NO", using the PADDLE controller as a means for the audio input.· It was based on something very similar to what you are describing.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe

IC Layout Engineer
Parallax, Inc.

Zoot · 2008-01-28 02:47

Somebody said...
would sound like the person on the other end of the phone in every Charlie Brown Episode?

Yes! Exactly! It's so simple and clever, really. And DaCosta implemented his idea nearly 30 years ago. He uses an op-amp and measures the zero-crossing to get a rough pitch. I read the whole chapter and he discards pitches below 160hz and above 1250 hz (but he does his counting and frequency measurements with like 10 dip sockets' worth of latches, shift-registers and 555 timers -- all that could be in firmware). His "ideal" length for the reference and bit pitches is ~ 1 second, with a 1-4 second pause between each. That gives about a 20 second window for receiving the nibble -- if all the bits aren't received within 20 seconds or so, the input is discarded as a bogus transmission.

The basic circuit is a crystal mic into a two-transistor buffer which feeds an op-amp for detecting the zero-crossing of the frequency.

He chose Nibbles because he says 4 binary "digits" are pretty easy to remember -- going to 6 or 8 bits made it nearly impossible for him to "speak" without a chart. I would tend to agree -- 5-8 bits and you might need a cheat sheet. But given the non-limitations on the firmware, yeah, you could make the "vocabulary" as extensive as your brain could handle.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
When the going gets weird, the weird turn pro. -- HST

1uffakind.com/robots/povBitMapBuilder.php
1uffakind.com/robots/resistorLadder.php

Beau Schwabe · 2008-01-28 03:42

Zoot,

Even if you build a basic one word command vocabulary and just have a few words, there are certain recognizable "patterns" produced depending on the choice of words used. Obviously there will be several words that might have similar patterns that you will want to avoid. Instead of focusing on the actual frequency, focus on the change in frequency (set a threshold and interpret this·as a HIGH or LOW)... sort of like FSK. Also combine this with amplitude patterns. Using just those two speech components might surprise you.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe

IC Layout Engineer
Parallax, Inc.

Post Edited (Beau Schwabe (Parallax)) : 1/28/2008 3:52:14 AM GMT

Zoot · 2008-01-28 05:46

Beau -- I think we're both saying the same thing in a different way -- DaCosta's circuit measures the change -- the first 1 second "tone" received is the threshold. He does a sampling over 2ms and uses that as the "frequency" -- his circuit outputs a square wave TTL from the opamp based on whatever is being received by the mic. Then in software he just uses the count -- if he gets X pulses generated from the original sinusoidal soundwave that's the threshold. Subsequent "tones" are declared "1" or "0" if the count is higher or lower than the threshold.

If I have time this week I'm going to try breadboarding something up. I think a Stamp can do this if it is dedicated to the task; my preference might be an SX.

Faisal -- sorry to get off what may have been, for you, a not necessarily productive tagent. I will echo Mike's comments -- it's tricky. Others at the forums have used the VR Stamp (no relation) kit with some degrees of success. My impression is that programming it takes some careful planning, and it doesn't seem cheap.

My own laptop (a Mac) does a nice job of recognizing my voice (after having been trained). Many of the projects I've seen that use voice recognition (or machine vision, for that matter) seem to end up using some kind of microprocessor (i.e. a laptop or desktop PC type system) running higher-end software. Not sure it's something that can be tackled with a standalone microcontroller.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
When the going gets weird, the weird turn pro. -- HST

1uffakind.com/robots/povBitMapBuilder.php
1uffakind.com/robots/resistorLadder.php

Speech Recognition

Comments