More NOOB questions, reader beware
average joe
Posts: 795
My past --n-- attempts at building a cache driver for the Touchburger have failed and I'm starting to pull my hair out. The read and write commands are directly from working code. I must be missing something!
The latest attempt. Download: Skeleton JCACHE external RAM driver from google code. Fresh copy. Rename touch_cache.spin in new efolder.. Test cache is reused, although I doubt it's that.. have code for CACHE SIZE = 8192 and 4096
The problem seems to be address bit 13 is aliasing. I think it's correct, I might reconnect LA and grab a few screenshots if it could help. I remember having a problem like this before and I can't remember what it was!
Here's the full cache.spin, hopefully you guys can help before I ragequit again!
The symptoms are all tests fail miserably. Walking address bits always show aliasing on A13. Other tests flat out fail at completing write ??? I did change this code to NOT use ramaddr and just directly use vmaddr, first version driver shifted vmaddr right every set161, now we do a little more but should be okay? I know hubaddr and line_size can't be trashed from previous experience.
If I put these commands back in the PASM engine I'm using, things work perfectly so I'm stumped!
The latest attempt. Download: Skeleton JCACHE external RAM driver from google code. Fresh copy. Rename touch_cache.spin in new efolder.. Test cache is reused, although I doubt it's that.. have code for CACHE SIZE = 8192 and 4096
The problem seems to be address bit 13 is aliasing. I think it's correct, I might reconnect LA and grab a few screenshots if it could help. I remember having a problem like this before and I can't remember what it was!
Here's the full cache.spin, hopefully you guys can help before I ragequit again!
{ Skeleton JCACHE external RAM driver Copyright (c) 2011 by David Betz Based on code by Steve Denson (jazzed) Copyright (c) 2010 by John Steven Denson Inspired by VMCOG - virtual memory server for the Propeller Copyright (c) February 3, 2010 by William Henning For the TouchBurger 3 Board - DateCode AUG2012 By James Moxham and Joe Heinz Basic port by Joe Heinz, Optimizations influenced by Steve Denson *huge thanks for V1 Honorable mention to David Betz for his patience during V1 Copyright (c) 2012 by John Steven Denson and Joe Heinz TERMS OF USE: MIT License Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. } CON ' default cache dimensions DEFAULT_INDEX_WIDTH = 6 DEFAULT_OFFSET_WIDTH = 7 ' cache line tag flags EMPTY_BIT = 30 DIRTY_BIT = 31 PUB image return @init_vm DAT org $0 ' initialization structure offsets ' $0: pointer to a two word mailbox ' $4: pointer to where to store the cache lines in hub ram ' $8: number of bits in the cache line index if non-zero (default is DEFAULT_INDEX_WIDTH) ' $a: number of bits in the cache line offset if non-zero (default is DEFAULT_OFFSET_WIDTH) ' note that $4 must be at least 2^(index_width+offset_width) bytes in size ' the cache line mask is returned in $0 init_vm mov t1, par ' get the address of the initialization structure rdlong pvmcmd, t1 ' pvmcmd is a pointer to the virtual address and read/write bit mov pvmaddr, pvmcmd ' pvmaddr is a pointer into the cache line on return add pvmaddr, #4 add t1, #4 rdlong cacheptr, t1 ' cacheptr is the base address in hub ram of the cache add t1, #4 rdlong t2, t1 wz if_nz mov index_width, t2 ' override the index_width default value add t1, #4 rdlong t2, t1 wz if_nz mov offset_width, t2 ' override the offset_width default value mov index_count, #1 shl index_count, index_width mov index_mask, index_count sub index_mask, #1 mov line_size, #1 shl line_size, offset_width mov t1, line_size sub t1, #1 wrlong t1, par ' put external memory initialization here shr line_size,#1 ' from V1 > offset for byte to word conversion *suggested by JSD or outa,maskP22 ' pin 22 high - LATCH OE - disable or dira,maskP22 ' and now set as an output mov dirb, #$FF ' latch all high - done uses dirb to set latch- call #done ' and set latch, release all pins- EXCEPT P22 for latch OE jmp #vmflush fillme long 0[128-fillme] ' first 128 cog locations are used for a direct mapped cache table fit 128 ' initialize the cache lines vmflush movd :flush, #0 mov t1, index_count :flush mov 0-0, empty_mask add :flush, dstinc djnz t1, #:flush ' start the command loop waitcmd wrlong zero, pvmcmd :wait rdlong vmline, pvmcmd wz if_z jmp #:wait shr vmline, offset_width wc ' carry is now one for read and zero for write mov set_dirty_bit, #0 ' make mask to set dirty bit on writes muxnc set_dirty_bit, dirty_mask mov line, vmline ' get the cache line index and line, index_mask mov hubaddr, line shl hubaddr, offset_width add hubaddr, cacheptr ' get the address of the cache line wrlong hubaddr, pvmaddr ' return the address of the cache line movs :ld, line movd :st, line :ld mov vmcurrent, 0-0 ' get the cache line tag and vmcurrent, tag_mask cmp vmcurrent, vmline wz ' z set means there was a cache hit if_nz call #miss ' handle a cache miss :st or 0-0, set_dirty_bit ' set the dirty bit on writes jmp #waitcmd ' wait for a new command ' line is the cache line index ' vmcurrent is current cache line ' vmline is new cache line ' hubaddr is the address of the cache line miss movd :test, line movd :st, line :test test 0-0, dirty_mask wz if_z jmp #:rd ' current cache line is clean, just read new one mov vmaddr, vmcurrent shl vmaddr, offset_width call #wr_cache_line ' write current cache line :rd mov vmaddr, vmline shl vmaddr, offset_width call #rd_cache_line ' read new cache line :st mov 0-0, vmline miss_ret ret ' pointers to mailbox entries pvmcmd long 0 ' on call this is the virtual address and read/write bit pvmaddr long 0 ' on return this is the address of the cache line containing the virtual address cacheptr long 0 ' address in hub ram where cache lines are stored vmline long 0 ' cache line containing the virtual address vmcurrent long 0 ' current selected cache line (same as vmline on a cache hit) line long 0 ' current cache line index set_dirty_bit long 0 ' DIRTY_BIT set on writes, clear on reads zero long 0 ' zero constant dstinc long 1<<9 ' increment for the destination field of an instruction t1 long 0 ' temporary variable t2 long 0 ' temporary variable tag_mask long !(1<<DIRTY_BIT) ' includes EMPTY_BIT index_width long DEFAULT_INDEX_WIDTH index_mask long 0 index_count long 0 offset_width long DEFAULT_OFFSET_WIDTH line_size long 0 ' line size in bytes empty_mask long (1<<EMPTY_BIT) dirty_mask long (1<<DIRTY_BIT) ' input parameters to rd_cache_line and wr_cache_line vmaddr long 0 ' external address hubaddr long 0 ' hub memory address '------------------------------------------------ SRAM Address Setup ----------------------------------------------------------------- '' looks like address bit 13 not correct? doubt it's hardware... set161and373 'mov ramaddr, vmaddr ' copy ram address shr vmaddr, #1 ' from old build 1 > '' schematic connects SRAM A0 to A0, not A1 - jsd or vmaddr, maxram ' mask off unused ram address bits '' setup pointer for hub and count mov ptr, hubaddr ' cant trash hubaddr mov len, line_size ' or line_size mov dirb, latchvalue ' save old latch value for restore at end of op. '' '' do locking here! '' or outa,maskP16P20 ' set control pins high or dira,maskP16P20 ' set control pins P16-P20 as outputs mov latchvalue,#%11111110 ' group 1, displays all off call #set373 ' send out to the latch and outa,maskP0P20low ' prepare data pins for address ''extended addressing, only for stacked SRAM - comment out and uncomment below for standard config or dira,maskP0P20P29 ' %00100000_11100000_00000000_00000000 ' make all address pins out cmp vmaddr,maskP19 wc ' check if we have extended address ' and do extended addressing muxnc outa, maskP29 ' and !mux onto p29 andn vmaddr,maskP0P18low ' mask off the low 19 bits '' end extended addressing, only for stacked SRAM 'or dira,maskP0P20 ' use this for standard ram config or outa,vmaddr ' send out ramaddr andn outa,maskP20 ' P20 clock low or outa,maskP20 ' P20 clock high or outa,maskP16P20 ' P16-P20 high andn dira,maskP29 ' %1101_1111_1111_1111_1111_1111_1111_1111 ' release P29, xmm only, else comment out mov latchvalue,#%11111101 ' group 2 call #set373 ' change to group 2 set161and373_ret ret '' returns @ INS window '------------------------------------------------ Latch Control ----------------------------------------------------------------- set373 or outa,maskP22 ' pin 22 high or dira,#%1_11111111 ' enable pins 0-7 and 8 as outputs and outa,maskP0P8low ' P0-P7 low or outa,latchvalue ' send out the data or outa,maskP8 ' P8 high, clocks out data andn outa,maskP22 ' pin 22 low set373_ret ret '' returns @ INS window done mov latchvalue, dirb ' restore old value call #set373 ' Set latch to vaule prior to cog opperation and dira,maskP0P20low ' tristates all the common pins, leaves P22 as is though '' '' '' do un-locking here! '' done_ret ret '---------------------------------------------------------------------------------------------------- ' ' rd_cache_line - read a cache line from external memory ' ' vmaddr is the external memory address to read ' hubaddr is the hub memory address to write ' line_size is the number of bytes to read ' '---------------------------------------------------------------------------------------------------- rd_cache_line pasmramtohub call #set161and373 ' set up the 161 counter and change to group 2 andn dira,maskP0P15 ' data bus inputs andn outa,maskP16 ' memory /rd low nop ' first read sometimes corrupt? ' ramtohub_loop mov data_16, ina ' get the data 3 wrword data_16, ptr ' move data to hub 1-2 andn outa, maskP20 ' clock 161 low 3 or outa, maskP20 ' clock 161 high 4 add ptr, #2 ' increment the hub address 1 djnz len,#ramtohub_loop ' 2 or outa,maskP16 ' memory /rd high call #done ' tristate pins rd_cache_line_ret ret '---------------------------------------------------------------------------------------------------- ' ' wr_cache_line - write a cache line to external memory ' ' vmaddr is the external memory address to write ' hubaddr is the hub memory address to read ' line_size is the number of bytes to write ' '---------------------------------------------------------------------------------------------------- wr_cache_line pasmhubtoram call #set161and373 ' set up the 161 counter and then change to group 2 or dira,maskP0P15 ' data bus outputs hubtoram_loop andn outa,maskP0P15 ' clear P0 to P15 for output 2 rdword data_16,ptr ' get the word from hub 1-2 or outa,data_16 ' send out the byte to P0-P15 3 andn outa,maskP17 ' set mem write low 4 add ptr, #2 ' increment by 2 bytes = 1 word. Put this here for small delay while writes 1 or outa,maskP17 ' mem write high 2 andn outa,maskP20 ' clock 161 low 3 or outa,maskP20 ' clock 161 high 4 djnz len,#hubtoram_loop ' loop this many times 1 call #done ' tristate pins and listen for command wr_cache_line_ret ret ' constants maskP0P18low long %11111111_11111000_00000000_00000000 ' P0-P18 low maskP16 long %00000000_00000001_00000000_00000000 ' pin 16 - SRAM_RD maskP17 long %00000000_00000010_00000000_00000000 ' pin 17 - SRAM_WR maskP19 long %00000000_00001000_00000000_00000000 ' pin 19 - LOAD - Group1 maskP20 long %00000000_00010000_00000000_00000000 ' pin 20 - Clock - Group1-Group2 maskP22 long %00000000_01000000_00000000_00000000 ' pin 22 - Latch OE - GroupPin maskP0P15 long %00000000_00000000_11111111_11111111 ' for masking words maskP16P20 long %00000000_00011111_00000000_00000000 ' control pins maskP0P20low long %11111111_11100000_00000000_00000000 ' for returning all group pins HiZ maskP0P8low long %11111111_11111111_11111110_00000000 ' P0-P8 low for set 373 maskP8 long %00000000_00000000_00000001_00000000 ' pin 8 for set 373 maskP0P20P29low long %11011111_11100000_00000000_00000000 ' xmm maskP29 long %00100000_00000000_00000000_00000000 ' xmm maskP0P20P29 long %00100000_00011111_11111111_11111111 ' xmm maxram long %00000000_00001111_11111111_11111111 '7_ff_ff - f_ff_ff latchvalue res ' current 373 value data_16 res ' general purpose value 'ramaddr res ' copy of vmaddr, not used ptr res ' pointer to hub len res ' copy of line_size for decimation fit 496
The symptoms are all tests fail miserably. Walking address bits always show aliasing on A13. Other tests flat out fail at completing write ??? I did change this code to NOT use ramaddr and just directly use vmaddr, first version driver shifted vmaddr right every set161, now we do a little more but should be okay? I know hubaddr and line_size can't be trashed from previous experience.
If I put these commands back in the PASM engine I'm using, things work perfectly so I'm stumped!
Test 0- Address Walking 0's 15 address bits: ERROR! Expected 0 @ 00007ffc after write to address 00005ffc 00002000 Test 1- Address Walking 1's 15 address bits: ERROR! Expected 0 @ 0 after write to address 00002000 00002000 Test 2- Incremental Pattern Test 32 KB :ERROR at $00000000 Expected $00000001 Received $00001801 Test 3- Pseudo-Random Pattern Test 524 KB : ERROR at $00000000 Expected $d0000001 Received $bb67aaab Test 4 -Pseudo-Random Pattern Test 32 KB :ERROR at $00000000 Expected $00000e80 Received $a0800e66 Address Walking 0's 18 address bits :ERROR! Expected 0 @ 0003fffc after write to address 0003dffc 00002000 Address Walking 1's 18 address bits. : ERROR! Expected 0 @ 0 after write to address 00002000 00002
Comments
*edit*
Looks like that was it! I'm testing full memory area now!
Thank you SO MUCH for taking a look, I guess I was starting to get tunnel vision.
It looks like that resolved my issue. Now build the cache and see if runs programs. It looks like I hit issues trying to use the full 1megaword... Pass first 2 tests, then fails the rest
*aedit* okay, now I'm having the problems that got me last time... I can load some programs in SimpleIDE and they run just fine in XMMC and XMM *hello.c* but drystone won't run now!
The other question that I've been trying to figure out... When running the Walking address bit test, for example, it says testing 18 address bits *524k* but with the right shift of line_size by one and the right shift of hubaddr, is that actually 17 bits? I should probably get my logic analyzer out and check..
thanks again @kuroneko, I looked at that stupid OR inst for over an hour and it didn't click!
I have a feeling it's something simple. I'll try to pull out the LA over the next few days and see if I can see anything obvious. It should benchmark quite well since read/write loops are 2 hub cycles!
Thanks again for all your help!
*edit*
Here's a full list of things that work:
Cache test, mostly. Throws errors when memory size increased, else okay.
FIBO: seems to run fine
Hello : seems okay
Dhrystone: loader opens terminal and nothing happens.
xmm-single and xmmc tested on all programs and seems to not make a difference. If there's other programs I should check that could help pinpoint I will. I know this version of dry.c is good because I ran it on the previous version Steve helped with.
Line 130 of the dry.c file:
I have 19 address bits, and these test okay. My extended memory hack has 20 and works in cog driver, but not in cache test? I have this feeling that with with both the shifts, it works out one bit higher than test results say? So when I run walking 0s 18 address bits, it's running from A0-A18, which is really 19 address bits and when I run incremental 1024kb is really 1024 kWord ??
So I have one version compiled. I'm going to import a few optimizations from cogdriver *using safe drivers, not fast drivers*
last result: and with stacked ramchip config *got it working*
I'm getting excited because with the fast version, only need to toggle 1 pin, not 2. SO, read and write could fit in 1 hub window using the counters! I'm thinking this could be a VERY fast cache
preliminary results!
Already beats SDram! I'm going to try again and compile for "standard" hardware, should be a titch faster! - strange how xmmc and xmm-single differ slightly as to the faster dhrystone. fibo's are the same though.. I'd really like to push this to use faster writes. I think it would work, just need to toggle 2 pins at the same time with counters....
The fibo function is so small it probably fits in the cache so it doesn't end up accessing external memory much and it doesn't use any global variables.
I really think Dhrystone should be compiled with -Os. The quote Steve provided said that optimizing compilers should be prevented from removing significant statements. That's not at all the same as saying optimization should be turned off, and in fact if you look on the web I believe most Dhrystone benchmarks are quoted with optimization turned on. Leaving it off might give people an inaccurate picture of Propeller performance.
But I guess this is drifting off topic...
Eric
And the PASM driver continues. Then, the Touch.h looks something like this:
Then, there's the Touch.c file that looks something like this: Now, I'm wondering if I'm on the right path, or there's some obvious glaring errors?
Last I heard Demoniator(TED) was AFK, so if anyone has used LOCKS before... I'm wondering if there's a command to call to enable the lock? I believe I have them implemented correctly in the cache driver, but still not tested.
Also, about cache_interface.spin... If I were to modify this to handle some functions... Can I do this? Or is it best to just leave the cache driver for just cache???
*and my 2 cents about optimizations... Has anyone actually TRIED optimized dhrystone, and if so, how different are the numbers?
I'm currently just using dhrystone as a "known-good" test, since it seems to find problems than the cache test doesn't. It is interesting to see how code modifications alter the results. Then next board I build will have an auto-increment-address function which should put read-writes in 1 hub-cycle per loop... That and the 6.25mhz xtal should provide some interesting results, even if they are "just synthetic benchmarks" :P
Thanks again guys!