Shop OBEX P1 Docs P2 Docs Learn Events
Store and look up 3,000 ZStrings - possible? — Parallax Forums

Store and look up 3,000 ZStrings - possible?

John KauffmanJohn Kauffman Posts: 653
edited 2010-11-15 11:56 in Propeller 1
For my next project I’d like to try switching from STAMP & SX to Prop. This is an art project, so a little different then the usual.

1 - Semi-random generate four letters to make a ‘word candidate’
2 - Display the word candidate on a serial LCD
3 - Check if the word candidate is in a word list. The word list consists of three thousand Zstrings with no line breaks or other characters for a total of 15,000 characters (including the zeros).
4 - If there is a match (the word candidate = one of the words in the word list), display the word on a second LCD.
(repeat)

Step 1 is not hard and the display steps (2 and 4) are easy thanks to the Object Exchange. For step 3 I expect no problem writing the loop that compares the word candidate with a given word retrieved from the word list

The challenge is the storage of the word list and retrieval of word after word to be tested. For the hub RAM, the code should fit in 16 Kb, so I think I have 16 kb left for the word list.

Questions:

Will 15,000 characters actually use 15 kb of Prop RAM? (Is there some loss in overhead or inefficiency or allowable minimum size of data units?)

Is there a limit on lines in the Prop Tool, because I have to write lines to get the word list into RAM?

Any opinions of deal breakers?
«1

Comments

  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2010-09-21 17:03
    What exactly does the word list represent? Is there some way to index the data rather than storing each and every entry? Some formula to make things a bit deterministic?

    As far as I know, there is no overhead for basic stored characters. If you dump 1000 characters in a DAT section, it will take up 1000 bytes.
  • kuronekokuroneko Posts: 3,623
    edited 2010-09-21 17:05
    Just some random thoughts:
    • You could store the dictionary in an ordinary data file and simply include it, no need cluttering up your code file.
    • If you know that you only have 4 letter words, get rid of the terminator. It also helps with comparison as you compare one long with another.
    • Sort your word list and use binary search (if speed is an issue).
  • John KauffmanJohn Kauffman Posts: 653
    edited 2010-09-21 17:26
    kuroneko:
    * You could store the dictionary in an ordinary data file and simply include it, no need cluttering up your code file.
    jk- I haven't tried that. It would change the source, but still be copied into RAM, right?

    * If you know that you only have 4 letter words, get rid of the terminator. It also helps with comparison as you compare one long with another.
    jk - Right, I could just keep copying out 4 character chunks.
    Do I understand the second sentence? A long is four bytes, and that would be the size of one of the word-list words and also the size of the candidate word. So the prop could work most efficiently comparing two longs. If the terminator zero is included it would either have to be stripped to fit in a long or the comparison would require the prop to do more work to compare the one-and-one-quarter long. Right? So I would have to make a major change if the project was bumped up to five letter words.

    * Sort your word list and use binary search (if speed is an issue).
    jk - I'm hoping to of the search in under 2 seconds. The list is already sorted. I'll have to learn the ideas behind a binary search. I assume that means the code will nt have to translate between ascii and the bits.

    Bobb:
    The word list is a string of 15,000 characters divided into words separated by zeros. That can easily be changed to some other separator. The words are actual four-letter words that are found in the dictionary. It starts with: abby0ably0acne0aday0aeon0agag0...
    I can easily change the format.
  • SapiehaSapieha Posts: 2,964
    edited 2010-09-21 17:44
    Hi John Kauffman.

    BLUE
    . That DAT file need be included to match LONG boundary.

    RED. If You add 0 terminations BYTE You can't anymore READ that by COG as LONG's.

    GREEN. If You send 4 ASCII correctly to LONG that COG will compare with Yours table it is not any need for conversion (If You use 7bits ASCII code in table) and always strip 8bit of one ASCII that will be compared to that table.
    ONLY thing You need be aware as You have TABLE words and words to compare BOTH in little else big edian.

    AND as You need speed You need write one PASM program for one COG to that compare.

    kuroneko:
    * You could store the dictionary in an ordinary data file and simply include it, no need cluttering up your code file.
    jk- I haven't tried that. It would change the source, but still be copied into RAM, right?

    * If you know that you only have 4 letter words, get rid of the terminator. It also helps with comparison as you compare one long with another.
    jk - Right, I could just keep copying out 4 character chunks.
    Do I understand the second sentence? A long is four bytes, and that would be the size of one of the word-list words and also the size of the candidate word. So the prop could work most efficiently comparing two longs. If the terminator zero is included it would either have to be stripped to fit in a long or the comparison would require the prop to do more work to compare the one-and-one-quarter long. Right? So I would have to make a major change if the project was bumped up to five letter words.

    * Sort your word list and use binary search (if speed is an issue).
    jk - I'm hoping to of the search in under 2 seconds. The list is already sorted. I'll have to learn the ideas behind a binary search. I assume that means the code will nt have to translate between ascii and the bits.

    Bobb:
    The word list is a string of 15,000 characters divided into words separated by zeros. That can easily be changed to some other separator. The words are actual four-letter words that are found in the dictionary. It starts with: abby0ably0acne0aday0aeon0agag0...
    I can easily change the format.
  • kuronekokuroneko Posts: 3,623
    edited 2010-09-21 17:45
    I haven't tried that. It would change the source, but still be copied into RAM, right?
    Yes, check the file directive in the manual (around p.107). It allows you to include an external file as data.
    Do I understand the second sentence? A long is four bytes, and that would be the size of one of the word-list words and also the size of the candidate word. So the prop could work most efficiently comparing two longs. If the terminator zero is included it would either have to be stripped to fit in a long or the comparison would require the prop to do more work to compare the one-and-one-quarter long. Right? So I would have to make a major change if the project was bumped up to five letter words.
    Pros and cons left and right :) If the words become longer you'd be back to normal string compare calls. You only get the long benefit when the words are 4 letters in size and long aligned (4n addresses). Anything which disturbs alignment and is of an odd length (e.g. 5 letters) requires extra work.
    I'm hoping to of the search in under 2 seconds. The list is already sorted. I'll have to learn the ideas behind a binary search. I assume that means the code will not have to translate between ascii and the bits.

    I never tried sorting an array of this size so I wouldn't know how long it takes. Binary search starts by looking in the middle of the sorted array and either finds a match or decides to keep searching in upper/lower half until there is no entry left. For your 3000 entries that should be at most 12 comparisons (2^12 = 4096).
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2010-09-21 18:03
    One thing to be aware of: longs are stored in little-endian byte order. So you will have to sort your words in an order that treats the first character as the least significant and the last character as most significant, e.g. "aaab" > "baaa" when compared as longs.

    -Phil
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-09-21 18:07
    Re-format it with a quick program or by hand, to the following:

    DAT

    long 0 ' force long alignment
    words byte "abby","ably","acne",...

    Note that the 15000 character string takes just 12000 bytes without the separator!
    It starts with: abby0ably0acne0aday0aeon0agag0...
    I can easily change the format.
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2010-09-21 18:45
    "word" means so much in computer languages, no one uses real words!

    What about putting the words in a indexed architecture in the upper 32K of a 64K EEPROM? This would give you room to expand the dictionary in the future and you would only have to write it once (as long as the indexing scheme stays the same).

    You calculate the storage location using a simple checksum. This would give you lookups in the realm of milliseconds.

    I have a program that uses this type of schema in my "Database" program (in my signature -- if you want to get a start on it). You can't use my program because it stores lookup tables in RAM that would exhaust too much space, but you could adapt it so these tables are not necessary, especially since you would be doing read-only.
  • Dave HeinDave Hein Posts: 6,347
    edited 2010-09-21 19:08
    So basically, you have 3000 32-bit numbers. A binary search will require only 12 comparisons. You should be able to do this in Spin without much trouble.
  • John KauffmanJohn Kauffman Posts: 653
    edited 2010-09-21 19:46
    Bobb:

    Good point on multiple uses of the word 'word.'

    I looked at the description of your database object a few days ago and thought that storing the list in EEPROM would be too slow for the search. But would that be true? Do you have a rough feel for how long it takes to read a long from your table in EEPROM?
    THanks.
  • AribaAriba Posts: 2,690
    edited 2010-09-21 19:56
    You can do it with longs, but it is not much more complicated with strings.
    Here is a little (untested) Spin code with examples for string search in a list and generating a random word.

    If your list is really a long ASCII string in the form:
    abby0ably0acne0aday0aeon0agag0...
    then the 0-characters must be replaced with zero-bytes to make valid strings. This is made by the first part in the main routine, and only once at startup.
    PUB main  | i, rnd
    
     repeat i from 0 to 14999            'replace ascii '0' with 0-byte
       if list[i] == "0"
         list[i] := 0
    
     repeat
       repeat i from 0 to 14999 step 5   'search word in list
         if strcomp(@rndword, @list[i])
           'code for match here
           quit
    
       repeat i from 0 to 3              'new random word
         rnd?
         rndword[i] := ||rnd // 26 + "a"   'only letters a..z
       
    DAT
    rndword  byte "abcd",0               'random word
    list     file "wordlist.txt"         'include word list (=byte array)
    

    Andy
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2010-09-21 19:58
    That's the beauty of indexing, it doesn't "search" per se, it gets an exact location (or at worst, a rough idea) of where the information is, then reads the entry (or at worst a few entries) to make sure it is what you were looking for.

    And because you are doing read-only, and you know all entries are identical in size and type, you can optimize your checksum value for greater speed, and you won't need multiple lengthy tables like I did.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2010-09-21 20:09
    The problem with using strcomp is that it only returns match or no-match. So it's not possible to use it in a binary search. I think doing a compare on longs is, by far, the most efficient search method, assuming the dictionary is sorted as I've indicated.

    -Phil
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2010-09-21 20:57
    This program assumes the 4-letter words are broken down into a long integer (though 4-byte sting is about the same thing, but would need different comparison in the program).

    When selecting the beginning hash value, you could play around with it (or write a script) to try to space the words out as much as possible, that will speed this up a little bit.
    If the words were perfectly spread out (very unlikely to be possible), the program would only have to read one long in EEPROM to confirm if the word exists in the dictionary or not.
    In a more likely scenario, most times only one long would need to be read, then periodically two, three, or four longs would be read in succession before a blank cell was read. If any of the longs read matches the desired value, it stops reading the EEPROM.

    Something like this would likely suffice (untested):
    CON
    
      start_table = $8000
      table_size  = $8000             ' size of table (in longs), maybe this should be $7FFF
    
    OBJ
    
      i2c   : "Basic_I2C_Driver"    ' for read/write operations to EEPROM (modified version)
    
    PUB init
    '' initiate i2c object
    
      i2c.Initialize(i2c#BootPin)                                                   ' start i2c object
    
    PUB checkword (nameAddr, expected_value) | val
    '' cycle through word's address until blank cell is found, if one of the cell match the expected value, return true
    
      REPEAT WHILE (val := read(gethash(nameAddr)))                                 ' easy/expensive way to check value
        IF (expected_value == val)
          RETURN true                                                               ' if current cell matches, return true                             
          
      RETURN false                                                                  ' if empty cell is found, and expected value has not been found yet, return false
    
    PUB read (addr) : val
    '' read page from EEPROM, return address of value
                              
      IF i2c.ReadPage(i2c#EEPROM, addr, @val, 4)                                    ' read a long at this address
        ABORT                     
    
      RETURN @val                                                                   ' return value found at address
    
    PUB gethash (nameAddr) : hash | i, length 
    ' return a simple checksum within the confines of table_size                  
    
      length := strsize(nameAddr) - 1                                               ' could force this to a specific length to speed things up (like 3)
    
      hash := $55555555                                                             ' alternating 1s and 0s (you can change this to whatever, just make it consistant for forever)
      REPEAT i FROM 0 TO length
        hash += byte[nameAddr + i] + hash << 5                                      ' cycle through input adding to hash value
    
      hash &= table_size << 2 + start_table                                         ' return a long address (lowest value is table start) 
    
    I can mess with this some more tomorrow. I may be able to move from this mock program to a working one without much effort.
  • AribaAriba Posts: 2,690
    edited 2010-09-21 21:05
    I just tested the string compare from my previous post.
    It takes only 78ms to search thru the whole list, without finding a match (with 80 MHz clock).

    Andy
  • KaosKiddKaosKidd Posts: 296
    edited 2010-09-22 07:28
    Some things to consider:
    #1: Will the list be added to or modified? If so, the resorting (or re indexing) is going to be required.
    #2: An Indexed approach requires more storage space to hold the index.
    #3: The data set is fixed; thus a search can be highly optimized knowing this.

    In a world where all things are considered equal:
    If the data is going to be static; not changing; and the data is stored in the same kind of RAM (HUB or EEPROM), and the data is "managed data", meaning it's sorted, and all elements are the same length, and the number of items being searched is small (3000 is small in this case) then all three possible searches are feasible:

    Brute Search: Starting at the top and moving down.
    Binary Search: Test the middle, divide in 1/2 and test again.
    Indexed Query: Read an index, get a data pointer, then test the data.

    The only reason this is the case is because the data set is so small and the propeller is fast. A binary search in PASM would be FAST.

    Just a humble opinion here.
    KK
  • John KauffmanJohn Kauffman Posts: 653
    edited 2010-09-22 12:57
    thanks, guys, especially Bobb for that working code.
    Ariba - thanks for speed test, I was hoping < 1 second, so that is great news.
    It looks like I have multiple options. I'll start with the easiest since it is my first prop project
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2010-09-22 13:32
    thanks, guys, especially Bobb for that working code.
    Ariba - thanks for speed test, I was hoping < 1 second, so that is great news.
    It looks like I have multiple options. I'll start with the easiest since it is my first prop project

    I wasn't done yet. I made a working system, it writes, it reads, it cleans your windows without leaving streaks, it does it all!

    It checks and returns if the word is found or not in about 3.8ms. All without using any VAR block. And (including the I2C object) is only 200 longs of RAM space.

    Here is a fairly complete system:
    CON
    
      { ==[ CLOCK SET ]== }       
      _CLKMODE      = XTAL1 + PLL16X
      _XINFREQ      = 5_000_000
    
      start_table = $1000            ' you will want to start at begeinning of upper 32K in EEPROM ($8000 I do believe), must be a "long address" (a multiple of 4)
      table_size  = $FFF             ' size of table (in longs), this should be $1FFF (for entire 32K), must be an exponent of 2, minus 1 (2^7 - 1 = $7F, 2^10 - 1 = $3FF, 2^14 - 1 = 1FFF, etc.)
    
    OBJ
    
      i2c   : "Basic_I2C_Driver2"    ' for read/write operations to EEPROM (modified version)
    
      DEBUG  : "FullDuplexSerial" 
    
    PUB Start | i, j       
    
      i2c.Initialize(i2c#BootPin)                                                   ' start i2c object
    
      DEBUG.start(31, 30, 0, 57600)
      waitcnt(clkfreq + cnt)  
      DEBUG.tx($D)
    
    
      DEBUG.str(string("Is the value found? "))
      DEBUG.dec(checkword(string("ones")))
      DEBUG.str(string($D,"0 = NO, -1 = YES",$D))
                      
      '' the below commented code will go through everything under dictionary in DAT to put in indexed location in EEPROM
      {{i := @dictionary
      DEBUG.str(string("Starting writing....",$D)
      REPEAT WHILE (long[i])
        putword(i, long[i])                                                         ' insert word into index space
        
        DEBUG.str(string("Write..."))',$D))                                         ' remove this DEBUG stuff to speed up adding stuff to indexing
        REPEAT j FROM 0 TO 3                                                        ' remove this DEBUG stuff to speed up adding stuff to indexing
          DEBUG.tx(byte[i + j])                                                     ' remove this DEBUG stuff to speed up adding stuff to indexing
        DEBUG.tx($D)                                                                ' remove this DEBUG stuff to speed up adding stuff to indexing
          
        i += 4
      DEBUG.str(string("Done.",$D)) }}  
    
      DEBUG.tx($D)
    
      repeat
        waitcnt(0)
        
    PUB checkword (nameAddr) | val, i, expected_value
    '' cycle through word's address until blank cell is found, if one of the cell match the expected value, return true
    
      expected_value := convertWordToInt(nameAddr)                                  ' get expected LONG
      i := gethash(nameAddr)                                                        ' starting address to search at
      REPEAT WHILE (val := read(i)) > 0                                             ' easy/expensive way to check value
        IF (expected_value == val)                                                  ' check indexed value
          RETURN true                                                               ' if current cell matches, return true
        i += 4                                                                      ' move to next long                   
          
      RETURN false                                                                  ' if empty cell is found, and expected value has not been found yet, return false
    
    {{PUB putword (nameAddr, value) | addr
    '' put a long into EEPROM at hash location
    
      addr := gethash(nameAddr)                                                     ' start at exact hash address
      REPEAT WHILE (read(addr))                                                     ' keep going until empty cell is found
        addr += 4                                                                   ' move one long
      write(addr, @value)                                                           ' write in empty cell
    }}  
    PRI convertWordToInt (nameAddr) : val | i
    '' transform 4 letter word into a long integer
                                                                                                    
      REPEAT i FROM 3 TO 0                                                          ' go through 4 characters of string
        val := val << 8 + byte[nameAddr + i]                                        ' put each character in a byte of the LONG
    
    {{PRI write (addr, valueAddr) | time
    '' write page to EEPROM with watchdog  
    
      IF i2c.WritePage(i2c#EEPROM, addr, valueAddr, 4)
        RETURN false
      time := cnt
      REPEAT WHILE i2c.WriteWait(i2c#EEPROM, addr)                                  ' wait for watchdog
        IF (cnt - time > clkfreq / 10)
          RETURN false
    
      RETURN true 
    }} 
    PRI read (addr) : val
    '' read page from EEPROM, return address of value
                              
      IF (i2c.ReadPage(i2c#EEPROM, addr, @val, 4))                                  ' read a long at this address
        RETURN false                     
    
      RETURN val                                                                    ' return value found at address
    
    PRI gethash (nameAddr) : hash | i, length 
    ' return a simple checksum within the confines of table_size                  
    
      hash := $55555555                                                             ' alternating 1s and 0s (you can change this to whatever, just make it consistant for forever)
      REPEAT i FROM 0 TO 3
        hash += byte[nameAddr + i] + hash << 5                                      ' cycle through input adding to hash value
    
      hash &= table_size << 2                                                       ' return a long address (lowest value is table start)
      hash += start_table
    
    DAT
    
    {{dictionary              byte    "oven","abbr","exit","unit","move"
                            byte    "only","list","then","line","prop"
                            byte    "deep","seal","will","loss","bass"
                            byte    "that","best","tell","heal","your"
                            byte    "back","data","pain","each","week"
                            byte    "weak","four","days","ones","knot"
                            byte    "mode","auto","with",0,0,0,0    
    }}
    
    All the commented out code is only needed to write the index. Once it is written to EEPROM, you shouldn't need to touch it again. There is no need to organize the word list other than into byte blocks. After the very last one, you need a long of 0s.
  • John KauffmanJohn Kauffman Posts: 653
    edited 2010-11-13 22:28
    Bobb - sorry I did not respond more quickly, my email alert was not on.

    This looks great. Much of the coding is above my skills, but maybe not too far. This will be a good lesson in some new techniques, commands and operators for me.

    My effort so far was to convert the dictionary to bytes externally and then just have the longs in DAT. I see yours does it on the fly which makes it easier for human to work with the dictionary.

    I also built a comparison PUB which works, but not as elegantly as yours.

    More comments to come. Again, much thanks.
  • tonyp12tonyp12 Posts: 1,951
    edited 2010-11-14 07:00
    If you stick to just using a-z (or A-Z) when you would need only
    5bits (32 possible values) for each charter.
    A little harder to decode, but not impossible.
  • John KauffmanJohn Kauffman Posts: 653
    edited 2010-11-14 10:51
    Here is what I don't get.

    Why use a hash? I think of that term as a functional box that holds an index. A value comes into the hash function and returned is the DAT address of the equivalent value. So it is needed when a random value needs to find its equivalent.

    But in this case we will be using every value in DAT for the compare, at least until we find a match. We can start the compare at the first long of the table and go to the last. So we don't need a look-up hash function to find the address of interest in the DAT table - the address of interest is the next long.

    Why not just populate the table and have the CheckWord function start at the beginning of the table and go to the end, without a hash function?

    Much thanks, this is a great course of study for me.
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2010-11-14 13:02
    The way indexing works is (optimal) only one value is ever compared. what my program does is store the values into EEPROM into addresses determined by the hash. the hash is a checksum integer that is based on the value being stored. thus each stored value is in a different location. when you want to retrieve the value you check the expected address. if it doesn't exist there, then the value doesn't exist. if a value is at that address, it checks to make sure it is the correct value (hash collisions are possible, and likely), if its not the correct one, it checks the next address until it runs into a blank value which should be very soon (this is the easiest storage scheme, but not the fastest -- but still way faster than brute force). If your table is less than 50% populated, there is a very good chance that only one address is read and checked.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2010-11-14 16:46
    Avoiding collisions depends a lot on the hash function chosen and on the domain over which it operates. Since a hash function is inherently a many-to-one mapping, it's seldom easy to find one that populates its range evenly with few collisions, especially when memory is limited.

    -Phil
  • John KauffmanJohn Kauffman Posts: 653
    edited 2010-11-14 17:35
    That makes sense. Let me put it in my own words to see if I have it right:

    1. Hash background info:
    1.1. A hash value can be assigned to a string
    1.2. The hash value is not obvious to a human eye (for example, not alphabetical)
    1.3. A hash value is reproducible- the same hash algorythym will always produce the same hash value from the same string
    1.4. Most hash values are unique, but code most have provision for non-unique

    2. When dictionary strings are stored:
    2.1. The dictionary string is examined and assigned a hash value
    2.2. The dictionary strings are put in the memory location that is the memory start + their hash
    2.3. The locations will be, roughly in the order of their hash values
    2.4. The order of the dictionary strings will be according to the hash, so the order will make little sense to a human eye
    2.5. If a dictionary string is about to be saved and its hash is not unique, then the dictionary string is saved in the next available memory space. In theory, if non-uniques are rare, that kind of adjustment should be infrequent and should find an open space within the next few slots

    3. When a candidate string is checked
    3.1. The candidate string is assigned a hash
    3.2. Since hashing is reproducible, it will have a hash the same as a dictionary hash if there is a matching word
    3.3. The comparison goes to the memory address of the candidate hash
    3.3.1. If the address is empty then there is no match of hashes and so the candidate does not have a match to a dictionary string.
    3.3.2. If the address is full there are two possiblities.
    3.3.2.1. The dictionary string must have the same hash as the candidate and thus they must be the same
    3.3.2.2. There is the coincidence of two strings with the same hash
    3.4. to address the possibility of a duplicate hash, the string is rad and compared on a higher (non-hash) level
    3.4.1. IF there is a match then the candidate string equals a dictionary string
    3.4.2. If there is not a match then a match might be in the next few addresses, so those are checked for hash equal, and if that passes, the high level check

    4. Advantages:
    4.1. The comparison can go straight to the address of a matching hash and begin testing. There is no need to start testing with memory address zero and going through all 3,000 possibilities

    Question1:
    If my dictionary takes up almost all of the available memory, during the save of the last strings, won’t the memory be almost full so the last few dictionary strings to get stored will have to move later by many addresses before they find an empty address? Perhaps even move to the end of the memory and not find an empty address at all? And if that is true, then the code that compares would always have to check from the 'ideal' location to the end.

    Q2:
    With 3,000 dictionary entries, what are the chances of two hashes being the same? I guess a good clue would be the number of significant digits in the hash value. And how well does the hash differentiate between two similar strings, 'similar' being in the characteristics that the hash considers important?

    Again, much thanks.
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2010-11-14 19:23
    It looks like you have the process understood pretty well. As for your questions. Yes it is possible that there are many addresses tested before finding an empty space. But this number will always be less than a brute force 0-3000 check. 3000 entries (12000 bytes) split into 32768 byte table (I think I remember you saying you would use the upper 32K of a 64K EEPROM) is less than half. So perfect hashing would produce just one read per test, but more likely, a small percentage will have to read a couple entries. And a very small percentage will have to read a few entries.

    This brings up something though, I don't remember if I put in a test to wrap back to address 0 if maximum memory value is reached. I will have to check this tomorrow.
  • John KauffmanJohn Kauffman Posts: 653
    edited 2010-11-14 19:25
    A thought on the later values not having an empty space.

    I thought I might have to write a program that went through and condensed the dictionary so that there are no empty addresses and the hash values are in perfect order.

    An idea:
    The dictionary will be static and can come from a text file.
    What if I use Excel to apply the same hash algorithm to the dictionary words? Then I can sort those hashes in Excel and save that dictionary in a text file. I can use that file of 'clean' hashes for SPIN to read into memory in serial order.
    That way I can be sure that the dictionary read into EEPROM has:
    - the dictionary entries are in order by their hash
    - duplicate hash values will be adjacent (not many spaces later where they found a hole)
    - no gaps in address
    - no chance that a late-arriving hash value could not find a place

    Q:
    Is it possible to replicate the hash equation in Excel perfectly enough that the Excel hash on a dictionary word will match a SPIN hash on a candidate word?

    Thanks.
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2010-11-14 19:28
    Q:
    Is it possible to replicate the hash equation in Excel perfectly enough that the Excel hash on a dictionary word will match a SPIN hash on a candidate word?
    Possible, yes, but it would take some work since the checksum is all based on bit values of characters, both are kind of annoying to work with in excel.
  • John KauffmanJohn Kauffman Posts: 653
    edited 2010-11-14 19:28
    I understand the wrap-around. That prevents "no room at the inn" for the last entries to be added.

    I will try the wrap-around as a problem to try solving.
  • Nick MuellerNick Mueller Posts: 815
    edited 2010-11-15 01:47
    Hashing only wastes a lot of storage.
    Binary search is fast and can use the most natural and most compact storage.
    It takes a maximum of 12 compares to find the string with binary search in 3000 entries.


    Nick
  • John KauffmanJohn Kauffman Posts: 653
    edited 2010-11-15 06:16
    Nick:
    So to do a binary compare, the dictionary values would have to be in order, right?
    Then compare candidate to dictionary value at middle of memory, determine if candidate is higher or lower, then go to midpoint of that half, etc.
    Thanks.
Sign In or Register to comment.