Store and look up 3,000 ZStrings - possible?
John Kauffman
Posts: 653
For my next project Id like to try switching from STAMP & SX to Prop. This is an art project, so a little different then the usual.
1 - Semi-random generate four letters to make a word candidate
2 - Display the word candidate on a serial LCD
3 - Check if the word candidate is in a word list. The word list consists of three thousand Zstrings with no line breaks or other characters for a total of 15,000 characters (including the zeros).
4 - If there is a match (the word candidate = one of the words in the word list), display the word on a second LCD.
(repeat)
Step 1 is not hard and the display steps (2 and 4) are easy thanks to the Object Exchange. For step 3 I expect no problem writing the loop that compares the word candidate with a given word retrieved from the word list
The challenge is the storage of the word list and retrieval of word after word to be tested. For the hub RAM, the code should fit in 16 Kb, so I think I have 16 kb left for the word list.
Questions:
Will 15,000 characters actually use 15 kb of Prop RAM? (Is there some loss in overhead or inefficiency or allowable minimum size of data units?)
Is there a limit on lines in the Prop Tool, because I have to write lines to get the word list into RAM?
Any opinions of deal breakers?
1 - Semi-random generate four letters to make a word candidate
2 - Display the word candidate on a serial LCD
3 - Check if the word candidate is in a word list. The word list consists of three thousand Zstrings with no line breaks or other characters for a total of 15,000 characters (including the zeros).
4 - If there is a match (the word candidate = one of the words in the word list), display the word on a second LCD.
(repeat)
Step 1 is not hard and the display steps (2 and 4) are easy thanks to the Object Exchange. For step 3 I expect no problem writing the loop that compares the word candidate with a given word retrieved from the word list
The challenge is the storage of the word list and retrieval of word after word to be tested. For the hub RAM, the code should fit in 16 Kb, so I think I have 16 kb left for the word list.
Questions:
Will 15,000 characters actually use 15 kb of Prop RAM? (Is there some loss in overhead or inefficiency or allowable minimum size of data units?)
Is there a limit on lines in the Prop Tool, because I have to write lines to get the word list into RAM?
Any opinions of deal breakers?
Comments
As far as I know, there is no overhead for basic stored characters. If you dump 1000 characters in a DAT section, it will take up 1000 bytes.
* You could store the dictionary in an ordinary data file and simply include it, no need cluttering up your code file.
jk- I haven't tried that. It would change the source, but still be copied into RAM, right?
* If you know that you only have 4 letter words, get rid of the terminator. It also helps with comparison as you compare one long with another.
jk - Right, I could just keep copying out 4 character chunks.
Do I understand the second sentence? A long is four bytes, and that would be the size of one of the word-list words and also the size of the candidate word. So the prop could work most efficiently comparing two longs. If the terminator zero is included it would either have to be stripped to fit in a long or the comparison would require the prop to do more work to compare the one-and-one-quarter long. Right? So I would have to make a major change if the project was bumped up to five letter words.
* Sort your word list and use binary search (if speed is an issue).
jk - I'm hoping to of the search in under 2 seconds. The list is already sorted. I'll have to learn the ideas behind a binary search. I assume that means the code will nt have to translate between ascii and the bits.
Bobb:
The word list is a string of 15,000 characters divided into words separated by zeros. That can easily be changed to some other separator. The words are actual four-letter words that are found in the dictionary. It starts with: abby0ably0acne0aday0aeon0agag0...
I can easily change the format.
BLUE. That DAT file need be included to match LONG boundary.
RED. If You add 0 terminations BYTE You can't anymore READ that by COG as LONG's.
GREEN. If You send 4 ASCII correctly to LONG that COG will compare with Yours table it is not any need for conversion (If You use 7bits ASCII code in table) and always strip 8bit of one ASCII that will be compared to that table.
ONLY thing You need be aware as You have TABLE words and words to compare BOTH in little else big edian.
AND as You need speed You need write one PASM program for one COG to that compare.
Pros and cons left and right If the words become longer you'd be back to normal string compare calls. You only get the long benefit when the words are 4 letters in size and long aligned (4n addresses). Anything which disturbs alignment and is of an odd length (e.g. 5 letters) requires extra work.
I never tried sorting an array of this size so I wouldn't know how long it takes. Binary search starts by looking in the middle of the sorted array and either finds a match or decides to keep searching in upper/lower half until there is no entry left. For your 3000 entries that should be at most 12 comparisons (2^12 = 4096).
-Phil
DAT
long 0 ' force long alignment
words byte "abby","ably","acne",...
Note that the 15000 character string takes just 12000 bytes without the separator!
What about putting the words in a indexed architecture in the upper 32K of a 64K EEPROM? This would give you room to expand the dictionary in the future and you would only have to write it once (as long as the indexing scheme stays the same).
You calculate the storage location using a simple checksum. This would give you lookups in the realm of milliseconds.
I have a program that uses this type of schema in my "Database" program (in my signature -- if you want to get a start on it). You can't use my program because it stores lookup tables in RAM that would exhaust too much space, but you could adapt it so these tables are not necessary, especially since you would be doing read-only.
Good point on multiple uses of the word 'word.'
I looked at the description of your database object a few days ago and thought that storing the list in EEPROM would be too slow for the search. But would that be true? Do you have a rough feel for how long it takes to read a long from your table in EEPROM?
THanks.
Here is a little (untested) Spin code with examples for string search in a list and generating a random word.
If your list is really a long ASCII string in the form:
abby0ably0acne0aday0aeon0agag0...
then the 0-characters must be replaced with zero-bytes to make valid strings. This is made by the first part in the main routine, and only once at startup.
Andy
And because you are doing read-only, and you know all entries are identical in size and type, you can optimize your checksum value for greater speed, and you won't need multiple lengthy tables like I did.
-Phil
When selecting the beginning hash value, you could play around with it (or write a script) to try to space the words out as much as possible, that will speed this up a little bit.
If the words were perfectly spread out (very unlikely to be possible), the program would only have to read one long in EEPROM to confirm if the word exists in the dictionary or not.
In a more likely scenario, most times only one long would need to be read, then periodically two, three, or four longs would be read in succession before a blank cell was read. If any of the longs read matches the desired value, it stops reading the EEPROM.
Something like this would likely suffice (untested): I can mess with this some more tomorrow. I may be able to move from this mock program to a working one without much effort.
It takes only 78ms to search thru the whole list, without finding a match (with 80 MHz clock).
Andy
#1: Will the list be added to or modified? If so, the resorting (or re indexing) is going to be required.
#2: An Indexed approach requires more storage space to hold the index.
#3: The data set is fixed; thus a search can be highly optimized knowing this.
In a world where all things are considered equal:
If the data is going to be static; not changing; and the data is stored in the same kind of RAM (HUB or EEPROM), and the data is "managed data", meaning it's sorted, and all elements are the same length, and the number of items being searched is small (3000 is small in this case) then all three possible searches are feasible:
Brute Search: Starting at the top and moving down.
Binary Search: Test the middle, divide in 1/2 and test again.
Indexed Query: Read an index, get a data pointer, then test the data.
The only reason this is the case is because the data set is so small and the propeller is fast. A binary search in PASM would be FAST.
Just a humble opinion here.
KK
Ariba - thanks for speed test, I was hoping < 1 second, so that is great news.
It looks like I have multiple options. I'll start with the easiest since it is my first prop project
I wasn't done yet. I made a working system, it writes, it reads, it cleans your windows without leaving streaks, it does it all!
It checks and returns if the word is found or not in about 3.8ms. All without using any VAR block. And (including the I2C object) is only 200 longs of RAM space.
Here is a fairly complete system: All the commented out code is only needed to write the index. Once it is written to EEPROM, you shouldn't need to touch it again. There is no need to organize the word list other than into byte blocks. After the very last one, you need a long of 0s.
This looks great. Much of the coding is above my skills, but maybe not too far. This will be a good lesson in some new techniques, commands and operators for me.
My effort so far was to convert the dictionary to bytes externally and then just have the longs in DAT. I see yours does it on the fly which makes it easier for human to work with the dictionary.
I also built a comparison PUB which works, but not as elegantly as yours.
More comments to come. Again, much thanks.
5bits (32 possible values) for each charter.
A little harder to decode, but not impossible.
Why use a hash? I think of that term as a functional box that holds an index. A value comes into the hash function and returned is the DAT address of the equivalent value. So it is needed when a random value needs to find its equivalent.
But in this case we will be using every value in DAT for the compare, at least until we find a match. We can start the compare at the first long of the table and go to the last. So we don't need a look-up hash function to find the address of interest in the DAT table - the address of interest is the next long.
Why not just populate the table and have the CheckWord function start at the beginning of the table and go to the end, without a hash function?
Much thanks, this is a great course of study for me.
-Phil
1. Hash background info:
1.1. A hash value can be assigned to a string
1.2. The hash value is not obvious to a human eye (for example, not alphabetical)
1.3. A hash value is reproducible- the same hash algorythym will always produce the same hash value from the same string
1.4. Most hash values are unique, but code most have provision for non-unique
2. When dictionary strings are stored:
2.1. The dictionary string is examined and assigned a hash value
2.2. The dictionary strings are put in the memory location that is the memory start + their hash
2.3. The locations will be, roughly in the order of their hash values
2.4. The order of the dictionary strings will be according to the hash, so the order will make little sense to a human eye
2.5. If a dictionary string is about to be saved and its hash is not unique, then the dictionary string is saved in the next available memory space. In theory, if non-uniques are rare, that kind of adjustment should be infrequent and should find an open space within the next few slots
3. When a candidate string is checked
3.1. The candidate string is assigned a hash
3.2. Since hashing is reproducible, it will have a hash the same as a dictionary hash if there is a matching word
3.3. The comparison goes to the memory address of the candidate hash
3.3.1. If the address is empty then there is no match of hashes and so the candidate does not have a match to a dictionary string.
3.3.2. If the address is full there are two possiblities.
3.3.2.1. The dictionary string must have the same hash as the candidate and thus they must be the same
3.3.2.2. There is the coincidence of two strings with the same hash
3.4. to address the possibility of a duplicate hash, the string is rad and compared on a higher (non-hash) level
3.4.1. IF there is a match then the candidate string equals a dictionary string
3.4.2. If there is not a match then a match might be in the next few addresses, so those are checked for hash equal, and if that passes, the high level check
4. Advantages:
4.1. The comparison can go straight to the address of a matching hash and begin testing. There is no need to start testing with memory address zero and going through all 3,000 possibilities
Question1:
If my dictionary takes up almost all of the available memory, during the save of the last strings, wont the memory be almost full so the last few dictionary strings to get stored will have to move later by many addresses before they find an empty address? Perhaps even move to the end of the memory and not find an empty address at all? And if that is true, then the code that compares would always have to check from the 'ideal' location to the end.
Q2:
With 3,000 dictionary entries, what are the chances of two hashes being the same? I guess a good clue would be the number of significant digits in the hash value. And how well does the hash differentiate between two similar strings, 'similar' being in the characteristics that the hash considers important?
Again, much thanks.
This brings up something though, I don't remember if I put in a test to wrap back to address 0 if maximum memory value is reached. I will have to check this tomorrow.
I thought I might have to write a program that went through and condensed the dictionary so that there are no empty addresses and the hash values are in perfect order.
An idea:
The dictionary will be static and can come from a text file.
What if I use Excel to apply the same hash algorithm to the dictionary words? Then I can sort those hashes in Excel and save that dictionary in a text file. I can use that file of 'clean' hashes for SPIN to read into memory in serial order.
That way I can be sure that the dictionary read into EEPROM has:
- the dictionary entries are in order by their hash
- duplicate hash values will be adjacent (not many spaces later where they found a hole)
- no gaps in address
- no chance that a late-arriving hash value could not find a place
Q:
Is it possible to replicate the hash equation in Excel perfectly enough that the Excel hash on a dictionary word will match a SPIN hash on a candidate word?
Thanks.
I will try the wrap-around as a problem to try solving.
Binary search is fast and can use the most natural and most compact storage.
It takes a maximum of 12 compares to find the string with binary search in 3000 entries.
Nick
So to do a binary compare, the dictionary values would have to be in order, right?
Then compare candidate to dictionary value at middle of memory, determine if candidate is higher or lower, then go to midpoint of that half, etc.
Thanks.