Data search algorithm
Michael O'Doul
Posts: 19
Hi all,
This is bound to be somewhat unusual. I have a 2p40 with 1 meg of eeprom hooked up (gotta love I2C). The eeprom will be holding UPC data in the following format:
13 digit number (compressed to 2 bytes), text data, text data <---both text data fields are of variable length
I'm running a serial barcode scanner, and I want to get output of the product descriptions based on the number input from the scanner. The basic IO is straightforward, what I'd like to know is whether anyone has any experience with implementing search algorithms with this much data on a stamp. Brute force is going to be too slow in this case.
If this works I'll add another i2c bus and another meg of eeprom.
Thanks in advance,
michael odoul
This is bound to be somewhat unusual. I have a 2p40 with 1 meg of eeprom hooked up (gotta love I2C). The eeprom will be holding UPC data in the following format:
13 digit number (compressed to 2 bytes), text data, text data <---both text data fields are of variable length
I'm running a serial barcode scanner, and I want to get output of the product descriptions based on the number input from the scanner. The basic IO is straightforward, what I'd like to know is whether anyone has any experience with implementing search algorithms with this much data on a stamp. Brute force is going to be too slow in this case.
If this works I'll add another i2c bus and another meg of eeprom.
Thanks in advance,
michael odoul
Comments
http://www.parallax.com/detail.asp?product_id=28140
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chris Savage
Parallax Tech Support
You have an EEPROM pre-loaded with UPC codes, and corresponding variable length text data.
You then want to scan a UPC code, and retrieve the text data.
Assuming this is correct, you have a challenge on your hands. I would suggest first of all, that the data would need to be "pre sorted". This gets us on the right track: As you scan the data, you can read a record, and determine if the record you are after is "forward" or "backward" in the table.
This is made much easier if you can make the fields "fixed length". Then you can use a "binary search" algorithm. Read the middle record of the set. If it's "too high", read the record half way between there and the beginning. If that record is too high, go back half way. If the record is too low, go half way back. Just keep narrowing down the seach.
If making the records fixed length would waste too much space, the alternative is to basically do the same thing, but then you have to go "half way" and then move forward (or backward) until you find an "Beginning of Record" marker.
Is this the type of scheme you're after?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
John R.
Click here to see my Nomad Build Log
We're definitely thinking along the same lines. I'd really like to avoid the fixed record length -- data efficiency is important. So your second option is what I've been mulling over while quietly hoping there was a better way. I'm worried that when I do a binary 'jump' and land in the middle of a record, I then have to read byte-by-byte until I get to another record start. It's the byte-by-byte part that scares me. I'm worried it could slow the thing to a crawl. If there's no better way, though, then so be it. And if its too slow I'll swallow hard and go to a fixed record length.
Thanks,
michael odoul
Your "byte-by-byte" might not be as slow as you think. You just need to define an EOR or BOR marker, and just incriment/decriment your index until you find it. You don't need to look at what the character is, just whether or not its the record marker. Then you can read and compare the UPC code. I'm not sure how the I2C gets the data from the EEPROM, but you might be able to help yourself by getting a buffers worth of data at a time. If the buffer is big enough, you might be able to get part (or all) of the UPC code in the "fetch". (I could be talking totally out of my butt here.) I've done binary searches on "files", but not with the STAMP, and I've never used an I2C EEPROM where I was "in control". I'm "assuming" that you send a command to fetch a range of memory and get the contents back.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
John R.
Click here to see my Nomad Build Log
I'm fiddling with the notion of giving a bit of intelligence to the search pattern. Since I can massage the data set any way I want with a PC before it goes into eeprom, I'm thinking of calculating an average for the record length, and then moving up and down through the eeprom using that number as a guide. The data file will go into eeprom pre-sorted, and even though each record will be variable length, the std deviation should be quite low, and it will be fairly easy to guess 'about where' the right record should be. This could really speed things up.
Also, since the data is presorted, I can make my first jump based on the value of the UPC itself. In other words, a UPC code like 3XXXXXXXXXXXX should be about 30% of the way through the file (the 'file' is actually going to be spread across 4 256k eeprom chips, but you get the idea), and a code like 7XXXXXXXXXXXX should be about 70% through the file, etc.
As I think it through, this might not be so bad after all...
Thanks again,
michael odoul
At the expense of some data storage space, you might consider the following. Some of this will add speed to the hardware portion and some will add speed to the software portion.
From the hardware side, you might consider·using FRAM rather than EEPROM, as it's significantly faster:
http://www.ramtron.com/ . In addition to that, the wait times ordinarily experienced with EEPROM will be far less.
From the hardware and software sides. use two datasets. The first is ficed length and contains the sequential DATA_KEY and appropriate sequential BLOCK_NUMBER. The second dataset is variable in length and contains (perhaps internally) BLOCK_NUMBER and the UPS_DATA. In addition, I suspect all blocks of data·will need to be·padded out to a full block size. Some storage devices permit block reads, which will reduce retrieval time significantly.
From the software side, now use the binary seach technique on the first dataset to retrieve the appropriate BLOCK_NUMBER. Fetch the appropriate BLOCK_NUMBER from the second data set. Only now will you have to do a byte-by-byte search (within the identified block) to find the exact record, rather than a byte-by-byte search through a larger set of data. Only you can determine an appropriate block size.
Regards,
Bruce Bates
Post Edited (Bruce Bates) : 2/10/2007 7:26:18 AM GMT
You may want to consider using SD media, and fixed-length fields.
Bean.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheap used 4-digit LED display with driver IC·www.hc4led.com
Low power SD Data Logger www.sddatalogger.com
SX-Video Display Modules www.sxvm.com
Stuff I'm selling on ebay http://search.ebay.com/_W0QQsassZhittconsultingQQhtZ-1
"USA Today has come out with a new survey - apparently, three out of every four people make up 75% of the population." - David Letterman
Max fetch time should be < 3 seconds, and preferably < 1 second.
Also, if I move to SD, I can use the entire 40 meg UPC database, not just a digest. I would dearly love to be able to do that.
thanks,
michael odoul
Post Edited (Michael O'Doul) : 2/10/2007 1:30:05 PM GMT