Okay, I'm with you guys so far. The part that I need to figure out is using word-aligned data from the SRAM's when it seems the cache expects byte-aligned. I could just use the lower 8 bits I suppose but then I waste half the memory available! I'm sure it's not that hard, just need to put it all together! Also, is there a limit to the size of the cache file?
*edit*
I see line_size controls the number of bytes to read and write. Is there a way to make sure this is an even number? That would make things a bit easier since I could just divide this by 2 to get the number of transfers from SRAM. Then, when doing the transfer: Read a word from Ram and put in 2 bytes in hub. Or read 2 bytes from hub and put a word in RAM. Does this sound like I'm heading in the right direction?
Okay, I'm with you guys so far. The part that I need to figure out is using word-aligned data from the SRAM's when it seems the cache expects byte-aligned. I could just use the lower 8 bits I suppose but then I waste half the memory available! I'm sure it's not that hard, just need to put it all together! Also, is there a limit to the size of the cache file?
The cache driver will never read or write in increments smaller than a cache line. You get to decide the size of a cache line but it will certainly be larger than 16 bits and it will be aligned on a cache-line sized boundary. In other words, if you use 64 byte cache lines they will be aligned on a 64 byte boundary which is, of course, also 16 bit aligned. I don't think alignment will be a problem.
I think doc has the design pretty much nailed down. There is a 2-screen variant I'm eagerly waiting to get my hands on too! Should function similar at the low level.
I'm not sure what you mean "available," Dr. Acula and I have boards built and running. If you're asking about "distribution" spec boards, you'd have to check with Doc. I know we are "close" to release and sure we could get boards to those interested in the future. I'm not sure how soon exactly.
These use the 40-pin touchscreen boards and I believe we will be supporting the SSD1289 controller. *These are the displays I'm using. Doc is using an ili9325 right now but it sounds like he will be converting to the SSD soon. I'm not sure if you guys have any of these displays or similar floating around.
I've been looking at the skeleton cache file and I think I understand it *sort of* I'm sure I will have questions later.
Thanks again for all the help!
By the way, I asked about hardware available to determine if it might be possible for me to help you with the cache driver. Obviously, I'd need hardware to do that.
Good, I just wanted to make sure about the line size. Like I said before I'm VERY dense with moments of idiot savant. I'm working on things and will let you guys know when I'm getting close. BTW, if anyone is interested in obtaining a board, just contact me.
*edit*
So far, it's mostly a "copy-paste-modify" job. It should be getting close though.
This is the overhead for the driver:
'get_values rdlong hubaddr, hubptr ' get hub address
' rdlong ramaddr, ramptr ' get ram address
' rdlong len, lenptr ' get length
' mov err, #5 ' err=5
'get_values_ret ret
'init 'mov err, #0 ' reset err=false=good
'mov dira,zero ' tristate the pins with the cog dira
' and dira,maskP0P20P22 ' tristates all the common pins
'done 'wrlong err, errptr ' status =0=false=good, else error x
'wrlong zero, comptr ' command =0 (done)
' Pass pasm_n = 0- 7 come to this with P0-P20 and P22 tristated and returns them as this too
set137 or dira,maskP22 ' pin 22 is an output
andn outa,maskP22 ' set P22low so Y0-Y7 are all high
or dira,maskP0P20 ' pins P0-P20 are outputs
and outa,maskP0P2low ' set these 3 pins low
or outa,pasm_n ' set the 137 pins
or outa,maskP22 ' pin 22 high
set137_ret ret ' return
load161pasm ' uses vmaddr
or outa,maskP0P20 ' set P0-P20 high
or dira,maskP0P20 ' output pins 0-20
mov pasm_n,#0 ' group 0
call #set137 ' set the 137 output
and outa,maskP0P18low ' pins 0-18 set low
or outa,vmaddr ' output addres to 161 chips
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
andn outa,maskP19 ' clock low
andn outa,maskP20 ' load low
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
load161pasm_ret ret
stop jmp #stop ' for debugging
memorytransfer or dira,maskP16P20 ' so /wr and other pins definitely high
or outa,maskP16P20
mov pasm_n,#1 ' back to group 1 for memory transfer
call #set137 ' as next routine will always be group 1
or dira,maskP16P20 ' output pins 16-20
or outa,maskP16P20 ' set P16-P20 high (P0-P15 set as inputs or outputs in the calling routine)
memorytransfer_ret ret
busoutput or dira,maskP0P15 ' set prop pins 0-15 as outputs
busoutput_ret ret
businput and dira,maskP16P31 ' set P0-P15 as inputs
businput_ret ret
delaynop nop
nop
nop
nop
delaynop_ret ret
And the actual reads:
'----------------------------------------------------------------------------------------------------
'
' rd_cache_line - read a cache line from external memory
'
' vmaddr is the external memory address to read
' hubaddr is the hub memory address to write
' line_size is the number of bytes to read
'
'----------------------------------------------------------------------------------------------------
rd_cache_line
' command T
pasmramtohub
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #businput ' set prop pins P0-P15 as inputs
andn outa,maskP16 ' memory /rd low
ramtohub_loop mov data_16,ina ' get the data
wrword data_16,hubaddr ' move data to hub
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
add hubaddr,#2 ' increment the hub address
djnz line_size,#ramtohub_loop
or outa,maskP16 ' memory /rd high
or dira,maskP0P15 ' %00000000_00000000_11111111_11111111 restore P0-P15as outputs
and dira,maskP0P20P22 ' tristates all the common pins
rd_cache_line_ret
ret
The writes:
'----------------------------------------------------------------------------------------------------
'
' wr_cache_line - write a cache line to external memory
'
' vmaddr is the external memory address to write
' hubaddr is the hub memory address to read
' line_size is the number of bytes to write
'
'----------------------------------------------------------------------------------------------------
wr_cache_line
' command S
pasmhubtoram
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #busoutput ' set prop pins P0-P15 as outputs
hubtoram_loop and outa,maskP16P31 '%11111111_11111111_00000000_00000000 ' clear for output
rdword data_16,hubaddr ' get the word from hub
and data_16,maskP0P15 ' mask to a word only
or outa,data_16 ' send out the byte to P0-P15
andn outa,maskP17 ' set mem write low
add hubaddr,#2 ' increment by 2 bytes = 1 word. Put this here for small delay while writes
or outa,maskP17 ' mem write high
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
djnz line_size,#hubtoram_loop ' loop this many times
and dira,maskP0P20P22 ' tristates all the common pins
wr_cache_line_ret
ret
Last is the variable overhead:
pasm_n long 0 ' general purpose value
data_16 long 0 ' general purpose value
maskP0P2low long %11111111_11111111_11111111_11111000 ' P0-P2 low
maskP0P20 long %00000000_00011111_11111111_11111111 ' P0-P18 enabled for output plus P19,P20
maskP0P18low long %11111111_11111000_00000000_00000000 ' P0-P18 low
maskP16 long %00000000_00000001_00000000_00000000 ' pin 16
maskP17 long %00000000_00000010_00000000_00000000 ' pin 17
maskP18 long %00000000_00000100_00000000_00000000 ' pin 18
maskP19 long %00000000_00001000_00000000_00000000 ' pin 19
maskP20 long %00000000_00010000_00000000_00000000 ' pin 20
maskP22 long %00000000_01000000_00000000_00000000 ' pin 22
maskP16P31 long %11111111_11111111_00000000_00000000 ' pin 16 to pin 31
maskP0P15 long %00000000_00000000_11111111_11111111 ' for masking words
maskP16P20 long %00000000_00011111_00000000_00000000
maskP0P20P22 long %11111111_10100000_00000000_00000000 ' for returning all group pins HiZ
I assume I don't need the "get_value" since that's already taken care of? Still need to work out the byte-word alignment problem but...
I know there's still the word-byte alignment problem but I built the previously described code.
Result of the BSTC are
c:\Users\Joe\Desktop\Propeller\GUI\Touch161>bstc
Brads Spin Tool Compiler v0.15.3 - Copyright 2008
Compiled for i386 Win32 at 08:17:48 on 2009/07/20
Loading Object skeleton_cache
Program size is 1092 longs
Compiled 181 Lines of Code in 0.006 Seconds
Any thoughts on how to fix the byte-word alignment issue? Do 2 transfers per loop and shift "line_size" right by 1 at the start?
I know there's still the word-byte alignment problem but I built the previously described code.
Result of the BSTC are
c:\Users\Joe\Desktop\Propeller\GUI\Touch161>bstc
Brads Spin Tool Compiler v0.15.3 - Copyright 2008
Compiled for i386 Win32 at 08:17:48 on 2009/07/20
Loading Object skeleton_cache
Program size is 1092 longs
Compiled 181 Lines of Code in 0.006 Seconds
Any thoughts on how to fix the byte-word alignment issue? Do 2 transfers per loop and shift "line_size" right by 1 at the start?
Okay, I will work on that. Should I be able to test the cache driver even if it's wasting half the memory? I'm @post 2...Compiled the catch, changed the board config. Stuck at running the cache_test? with the interface?
*edit* ok RTFM and I think I got it... I'll be back to you with the results:
Okay, I will work on that. Should I be able to test the cache driver even if it's wasting half the memory? I'm @post 2...Compiled the catch, changed the board config. Stuck at running the cache_test? with the interface?
Hang on. Maybe I misunderstood what you meant. Why do you need to waste half of the memory? Just fetch 16 bits at a time and store them into hub memory using wrword, increment the address by 2 and decrement the count by 2. You can assume that the cache driver will never ask for an odd number of bytes so you don't have to take care of that case.
@averagejoe, I don't believe you need to worry about caching using the sram. The clever boffins here have managed to write a generic cache driver that uses the SD card and hub ram. That works on a wide variety of boards - basically any board with an SD card. Just change the pin numbers. It is fast because most of the time it is running from hub, and it does not 'wear out' the SD card as all the SD card accesses are reads, not writes.
So if you you take the standard XMM code for the Board of Education and just change the SD pin numbers you can run huge C programs on our board.
That simplifies things because then all the SRAM accesses are done from C code. And all the PASM part does not need to change. So all you really have to do is rewrite the PUB docmd() routine to run from C, as well as change the cog start and stop methods to the C syntax.
GCC never needs to know about the SRAM and that means that GCC does not need changing.
A few weeks back I thought a lot about using the SRAM as the cache for GCC but I really don't think we need it any more. And that frees up the entire SRAM for bitmaps and fonts and large text buffers - all of which will be handled by a C program (which will probably ultimately become a .h program). There is a Spin to C translator out there, so it may even be possible to feed routines into this and porting over the spin code we have should be even easier. The pasm part stays as it is, and the spin gets converted to C.
@averagejoe, I don't believe you need to worry about caching using the sram. The clever boffins here have managed to write a generic cache driver that uses the SD card and hub ram. That works on a wide variety of boards - basically any board with an SD card. Just change the pin numbers. It is fast because most of the time it is running from hub, and it does not 'wear out' the SD card as all the SD card accesses are reads, not writes.
So if you you take the standard XMM code for the Board of Education and just change the SD pin numbers you can run huge C programs on our board.
That simplifies things because then all the SRAM accesses are done from C code. And all the PASM part does not need to change. So all you really have to do is rewrite the PUB docmd() routine to run from C, as well as change the cog start and stop methods to the C syntax.
GCC never needs to know about the SRAM and that means that GCC does not need changing.
A few weeks back I thought a lot about using the SRAM as the cache for GCC but I really don't think we need it any more. And that frees up the entire SRAM for bitmaps and fonts and large text buffers - all of which will be handled by a C program (which will probably ultimately become a .h program). There is a Spin to C translator out there, so it may even be possible to feed routines into this and porting over the spin code we have should be even easier. The pasm part stays as it is, and the spin gets converted to C.
While you are correct that you can use the SD cache driver to run large C programs from an SD card, the performance will not be nearly as good as you could get running from your parallel SRAM if you can figure out a way to share the SRAM between the display driver and the cache driver.
I see an option for split, if this is what I assume I think this would be best candidate. If you can split the catching between SRAM and SD, is there a way to decide what goes where? I will try porting the "working" touchcode to C and test with just SD catching. I think eventually it would be good to have the SRAM available, but might not be top priority right now!
While you are correct that you can use the SD cache driver to run large C programs from an SD card, the performance will not be nearly as good as you could get running from your parallel SRAM if you can figure out a way to share the SRAM between the display driver and the cache driver.
Yes it would be interesting to do the experiment. I was basing my comments on experiments we did last year with Catalina where we found in the end that caching was so efficient that the actual speed of transfers to and from the cache hardly seemed to affect the overall performance.
But, hypothetically, if one did use the sram, there are two simple routines to transfer data to and from hub. Pass to a pasm routine a single letter command to send or receive, a long for hub address, a long for ram address and number of bytes.
And I guess you need a flag somewhere in hub to say the code is using the sram or the cache is using the sram.
I don't think it will be needed though. XMMC from hub is fast, and it could be as fast as LMM when functions are small enough to fit in hub. Whenever something runs a bit slow, I find it is easier to port things over to pasm.
Is there a generic big C program we can use to test XMMC? A text based adventure program or something?
The only problem with XMMC comes when you use up all of HUB RAM - it is possible to run with a smaller cache of course, but performance does drop.
With the right device, XMMC is faster than SPIN; with the wrong device (SDcard) it is slower than SPIN. XMMC on EEPROM is faster than XMMC on SDcard.
XMM-SINGLE lets you use all of SRAM for code and data; it is usually slower than XMMC.
One of the bigger programs I've used for testing is David's EBASIC2. It works great and can quickly highlight performance differences. EBASIC2 is in the Propeller-GCC demos package.
One of the biggest challenges with XMMC on SD is the large cache line. SD accesses are done in 512-byte chunks, which allows for only 16 cache lines in an 8K cache. It would be interesting to try some experiments with a 16K cache to see how much improvement we would get. There have also been some ideas about reducing the cache line size tto 256 bytes by throwing away half the data on each access, or providing a small secondary cache that would hold the unused 256 bytes in case they are needed later.
Hey guys, I've been working on the cache driver and I think I'm close. What I'm trying to figure out now is the best way to finish the byte-word alignment issue. It seems the code was pretty close already. The only remaining issue *I THINK* is the best way to change "line_size". My initial thought was to shift line_size right by 1 somewhere at the beginning of the call. My other thought was to subtract 1 before the "djnz line_size,#ramtohub_loop" or "djnz line_size,#hubtoram_loop" but this would take an extra instruction every time the main loop runs. Any advice on where to put the "shr line_size, #1" Maybe in "Load161?"
Here's what I have so far.
'get_values rdlong hubaddr, hubptr ' get hub address
' rdlong ramaddr, ramptr ' get ram address
' rdlong len, lenptr ' get length
' mov err, #5 ' err=5
'get_values_ret ret
'init 'mov err, #0 ' reset err=false=good
'mov dira,zero ' tristate the pins with the cog dira
' and dira,maskP0P20P22 ' tristates all the common pins
'done 'wrlong err, errptr ' status =0=false=good, else error x
'wrlong zero, comptr ' command =0 (done)
' Pass pasm_n = 0- 7 come to this with P0-P20 and P22 tristated and returns them as this too
set137 or dira,maskP22 ' pin 22 is an output
andn outa,maskP22 ' set P22low so Y0-Y7 are all high
or dira,maskP0P20 ' pins P0-P20 are outputs
and outa,maskP0P2low ' set these 3 pins low
or outa,pasm_n ' set the 137 pins
or outa,maskP22 ' pin 22 high
set137_ret ret ' return
load161pasm ' uses vmaddr
or outa,maskP0P20 ' set P0-P20 high
or dira,maskP0P20 ' output pins 0-20
mov pasm_n,#0 ' group 0
call #set137 ' set the 137 output
and outa,maskP0P18low ' pins 0-18 set low
or outa,vmaddr ' output addres to 161 chips
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
andn outa,maskP19 ' clock low
andn outa,maskP20 ' load low
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
load161pasm_ret ret
stop jmp #stop ' for debugging
memorytransfer or dira,maskP16P20 ' so /wr and other pins definitely high
or outa,maskP16P20
mov pasm_n,#1 ' back to group 1 for memory transfer
call #set137 ' as next routine will always be group 1
or dira,maskP16P20 ' output pins 16-20
or outa,maskP16P20 ' set P16-P20 high (P0-P15 set as inputs or outputs in the calling routine)
memorytransfer_ret ret
busoutput or dira,maskP0P15 ' set prop pins 0-15 as outputs
busoutput_ret ret
businput and dira,maskP16P31 ' set P0-P15 as inputs
businput_ret ret
delaynop nop
nop
nop
nop
delaynop_ret ret
'----------------------------------------------------------------------------------------------------
'
' rd_cache_line - read a cache line from external memory
'
' vmaddr is the external memory address to read
' hubaddr is the hub memory address to write
' line_size is the number of bytes to read
'
'----------------------------------------------------------------------------------------------------
rd_cache_line
' command T
pasmramtohub
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #businput ' set prop pins P0-P15 as inputs
andn outa,maskP16 ' memory /rd low
ramtohub_loop mov data_16,ina ' get the data
wrword data_16,hubaddr ' move data to hub
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
add hubaddr,#2 ' increment the hub address
djnz line_size,#ramtohub_loop
or outa,maskP16 ' memory /rd high
or dira,maskP0P15 ' %00000000_00000000_11111111_11111111 restore P0-P15as outputs
and dira,maskP0P20P22 ' tristates all the common pins
rd_cache_line_ret
ret
'----------------------------------------------------------------------------------------------------
'
' wr_cache_line - write a cache line to external memory
'
' vmaddr is the external memory address to write
' hubaddr is the hub memory address to read
' line_size is the number of bytes to write
'
'----------------------------------------------------------------------------------------------------
wr_cache_line
' command S
pasmhubtoram
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #busoutput ' set prop pins P0-P15 as outputs
hubtoram_loop and outa,maskP16P31 '%11111111_11111111_00000000_00000000 ' clear for output
rdword data_16,hubaddr ' get the word from hub
and data_16,maskP0P15 ' mask to a word only
or outa,data_16 ' send out the byte to P0-P15
andn outa,maskP17 ' set mem write low
add hubaddr,#2 ' increment by 2 bytes = 1 word. Put this here for small delay while writes
or outa,maskP17 ' mem write high
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
djnz line_size,#hubtoram_loop ' loop this many times
and dira,maskP0P20P22 ' tristates all the common pins
wr_cache_line_ret
ret
pasm_n long 0 ' general purpose value
data_16 long 0 ' general purpose value
maskP0P2low long %11111111_11111111_11111111_11111000 ' P0-P2 low
maskP0P20 long %00000000_00011111_11111111_11111111 ' P0-P18 enabled for output plus P19,P20
maskP0P18low long %11111111_11111000_00000000_00000000 ' P0-P18 low
maskP16 long %00000000_00000001_00000000_00000000 ' pin 16
maskP17 long %00000000_00000010_00000000_00000000 ' pin 17
maskP18 long %00000000_00000100_00000000_00000000 ' pin 18
maskP19 long %00000000_00001000_00000000_00000000 ' pin 19
maskP20 long %00000000_00010000_00000000_00000000 ' pin 20
maskP22 long %00000000_01000000_00000000_00000000 ' pin 22
maskP16P31 long %11111111_11111111_00000000_00000000 ' pin 16 to pin 31
maskP0P15 long %00000000_00000000_11111111_11111111 ' for masking words
maskP16P20 long %00000000_00011111_00000000_00000000
maskP0P20P22 long %11111111_10100000_00000000_00000000 ' for returning all group pins HiZ
fit 496
Please advise the best place to put the "overhead" such as Load161. I'm also not sure of what tests to run in test_cache once the cache file is built. Thanks as always!
*edit*
I tried "T a" and get a bunch of nonsense from the terminal?
{
Skeleton JCACHE external RAM driver
Copyright (c) 2011 by David Betz
Based on code by Steve Denson (jazzed)
Copyright (c) 2010 by John Steven Denson
Inspired by VMCOG - virtual memory server for the Propeller
Copyright (c) February 3, 2010 by William Henning
For the EuroTouch 161 By James Moxham and Joe Heinz
TERMS OF USE: MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
}
CON
' default cache dimensions
DEFAULT_INDEX_WIDTH = 6
DEFAULT_OFFSET_WIDTH = 7
' cache line tag flags
EMPTY_BIT = 30
DIRTY_BIT = 31
PUB image
return @init_vm
DAT
org $0
' initialization structure offsets
' $0: pointer to a two word mailbox
' $4: pointer to where to store the cache lines in hub ram
' $8: number of bits in the cache line index if non-zero (default is DEFAULT_INDEX_WIDTH)
' $a: number of bits in the cache line offset if non-zero (default is DEFAULT_OFFSET_WIDTH)
' note that $4 must be at least 2^(index_width+offset_width) bytes in size
' the cache line mask is returned in $0
init_vm mov t1, par ' get the address of the initialization structure
rdlong pvmcmd, t1 ' pvmcmd is a pointer to the virtual address and read/write bit
mov pvmaddr, pvmcmd ' pvmaddr is a pointer into the cache line on return
add pvmaddr, #4
add t1, #4
rdlong cacheptr, t1 ' cacheptr is the base address in hub ram of the cache
add t1, #4
rdlong t2, t1 wz
if_nz mov index_width, t2 ' override the index_width default value
add t1, #4
rdlong t2, t1 wz
if_nz mov offset_width, t2 ' override the offset_width default value
mov index_count, #1
shl index_count, index_width
mov index_mask, index_count
sub index_mask, #1
mov line_size, #1
shl line_size, offset_width
mov t1, line_size
sub t1, #1
wrlong t1, par
' put external memory initialization here
jmp #vmflush
fillme long 0[128-fillme] ' first 128 cog locations are used for a direct mapped cache table
fit 128
' initialize the cache lines
vmflush movd :flush, #0
mov t1, index_count
:flush mov 0-0, empty_mask
add :flush, dstinc
djnz t1, #:flush
' start the command loop
waitcmd wrlong zero, pvmcmd
:wait rdlong vmline, pvmcmd wz
if_z jmp #:wait
shr vmline, offset_width wc ' carry is now one for read and zero for write
mov set_dirty_bit, #0 ' make mask to set dirty bit on writes
muxnc set_dirty_bit, dirty_mask
mov line, vmline ' get the cache line index
and line, index_mask
mov hubaddr, line
shl hubaddr, offset_width
add hubaddr, cacheptr ' get the address of the cache line
wrlong hubaddr, pvmaddr ' return the address of the cache line
movs :ld, line
movd :st, line
:ld mov vmcurrent, 0-0 ' get the cache line tag
and vmcurrent, tag_mask
cmp vmcurrent, vmline wz ' z set means there was a cache hit
if_nz call #miss ' handle a cache miss
:st or 0-0, set_dirty_bit ' set the dirty bit on writes
jmp #waitcmd ' wait for a new command
' line is the cache line index
' vmcurrent is current cache line
' vmline is new cache line
' hubaddr is the address of the cache line
miss movd :test, line
movd :st, line
:test test 0-0, dirty_mask wz
if_z jmp #:rd ' current cache line is clean, just read new one
mov vmaddr, vmcurrent
shl vmaddr, offset_width
call #wr_cache_line ' write current cache line
:rd mov vmaddr, vmline
shl vmaddr, offset_width
call #rd_cache_line ' read new cache line
:st mov 0-0, vmline
miss_ret ret
' pointers to mailbox entries
pvmcmd long 0 ' on call this is the virtual address and read/write bit
pvmaddr long 0 ' on return this is the address of the cache line containing the virtual address
cacheptr long 0 ' address in hub ram where cache lines are stored
vmline long 0 ' cache line containing the virtual address
vmcurrent long 0 ' current selected cache line (same as vmline on a cache hit)
line long 0 ' current cache line index
set_dirty_bit long 0 ' DIRTY_BIT set on writes, clear on reads
zero long 0 ' zero constant
dstinc long 1<<9 ' increment for the destination field of an instruction
t1 long 0 ' temporary variable
t2 long 0 ' temporary variable
tag_mask long !(1<<DIRTY_BIT) ' includes EMPTY_BIT
index_width long DEFAULT_INDEX_WIDTH
index_mask long 0
index_count long 0
offset_width long DEFAULT_OFFSET_WIDTH
line_size long 0 ' line size in bytes
empty_mask long (1<<EMPTY_BIT)
dirty_mask long (1<<DIRTY_BIT)
' input parameters to rd_cache_line and wr_cache_line
vmaddr long 0 ' external address
hubaddr long 0 ' hub memory address
'get_values rdlong hubaddr, hubptr ' get hub address
' rdlong ramaddr, ramptr ' get ram address
' rdlong len, lenptr ' get length
' mov err, #5 ' err=5
'get_values_ret ret
'init 'mov err, #0 ' reset err=false=good
'mov dira,zero ' tristate the pins with the cog dira
' and dira,maskP0P20P22 ' tristates all the common pins
'done 'wrlong err, errptr ' status =0=false=good, else error x
'wrlong zero, comptr ' command =0 (done)
' Pass pasm_n = 0- 7 come to this with P0-P20 and P22 tristated and returns them as this too
set137 or dira,maskP22 ' pin 22 is an output
andn outa,maskP22 ' set P22low so Y0-Y7 are all high
or dira,maskP0P20 ' pins P0-P20 are outputs
and outa,maskP0P2low ' set these 3 pins low
or outa,pasm_n ' set the 137 pins
or outa,maskP22 ' pin 22 high
set137_ret ret ' return
load161pasm ' uses vmaddr
shr line_size, #1 ' devide line_size by two for word-byte
or outa,maskP0P20 ' set P0-P20 high
or dira,maskP0P20 ' output pins 0-20
mov pasm_n,#0 ' group 0
call #set137 ' set the 137 output
and outa,maskP0P18low ' pins 0-18 set low
or outa,vmaddr ' output addres to 161 chips
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
andn outa,maskP19 ' clock low
andn outa,maskP20 ' load low
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
load161pasm_ret ret
stop jmp #stop ' for debugging
memorytransfer or dira,maskP16P20 ' so /wr and other pins definitely high
or outa,maskP16P20
mov pasm_n,#1 ' back to group 1 for memory transfer
call #set137 ' as next routine will always be group 1
or dira,maskP16P20 ' output pins 16-20
or outa,maskP16P20 ' set P16-P20 high (P0-P15 set as inputs or outputs in the calling routine)
memorytransfer_ret ret
busoutput or dira,maskP0P15 ' set prop pins 0-15 as outputs
busoutput_ret ret
businput and dira,maskP16P31 ' set P0-P15 as inputs
businput_ret ret
delaynop nop
nop
nop
nop
delaynop_ret ret
'----------------------------------------------------------------------------------------------------
'
' rd_cache_line - read a cache line from external memory
'
' vmaddr is the external memory address to read
' hubaddr is the hub memory address to write
' line_size is the number of bytes to read
'
'----------------------------------------------------------------------------------------------------
rd_cache_line
' command T
pasmramtohub
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #businput ' set prop pins P0-P15 as inputs
andn outa,maskP16 ' memory /rd low
ramtohub_loop mov data_16,ina ' get the data
wrword data_16,hubaddr ' move data to hub
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
add hubaddr,#2 ' increment the hub address
djnz line_size,#ramtohub_loop
or outa,maskP16 ' memory /rd high
or dira,maskP0P15 ' %00000000_00000000_11111111_11111111 restore P0-P15as outputs
and dira,maskP0P20P22 ' tristates all the common pins
rd_cache_line_ret
ret
'----------------------------------------------------------------------------------------------------
'
' wr_cache_line - write a cache line to external memory
'
' vmaddr is the external memory address to write
' hubaddr is the hub memory address to read
' line_size is the number of bytes to write
'
'----------------------------------------------------------------------------------------------------
wr_cache_line
' command S
pasmhubtoram
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #busoutput ' set prop pins P0-P15 as outputs
hubtoram_loop and outa,maskP16P31 '%11111111_11111111_00000000_00000000 ' clear for output
rdword data_16,hubaddr ' get the word from hub
and data_16,maskP0P15 ' mask to a word only
or outa,data_16 ' send out the byte to P0-P15
andn outa,maskP17 ' set mem write low
add hubaddr,#2 ' increment by 2 bytes = 1 word. Put this here for small delay while writes
or outa,maskP17 ' mem write high
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
djnz line_size,#hubtoram_loop ' loop this many times
and dira,maskP0P20P22 ' tristates all the common pins
wr_cache_line_ret
ret
pasm_n long 0 ' general purpose value
data_16 long 0 ' general purpose value
maskP0P2low long %11111111_11111111_11111111_11111000 ' P0-P2 low
maskP0P20 long %00000000_00011111_11111111_11111111 ' P0-P18 enabled for output plus P19,P20
maskP0P18low long %11111111_11111000_00000000_00000000 ' P0-P18 low
maskP16 long %00000000_00000001_00000000_00000000 ' pin 16
maskP17 long %00000000_00000010_00000000_00000000 ' pin 17
maskP18 long %00000000_00000100_00000000_00000000 ' pin 18
maskP19 long %00000000_00001000_00000000_00000000 ' pin 19
maskP20 long %00000000_00010000_00000000_00000000 ' pin 20
maskP22 long %00000000_01000000_00000000_00000000 ' pin 22
maskP16P31 long %11111111_11111111_00000000_00000000 ' pin 16 to pin 31
maskP0P15 long %00000000_00000000_11111111_11111111 ' for masking words
maskP16P20 long %00000000_00011111_00000000_00000000
maskP0P20P22 long %11111111_10100000_00000000_00000000 ' for returning all group pins HiZ
fit 496
I believe I have filled this out. The "overhead" and rd and wr cache lines *Bottom of file up* were taken from this code that's been tested good..It's a bit long, so I remove some of the portions..
DAT
'' +-----------------------------------------------------------------------------------------------+
'' | Touchblade 161 Ram Driver (with grateful acknowlegements to Cluso and Average Joe) |
'' +-----------------------------------------------------------------------------------------------+
org 0
tbp2_start ' setup the pointers to the hub command interface (saves execution time later
' +-- These instructions are overwritten as variables after start
comptr mov comptr, par ' -| hub pointer to command
hubptr mov hubptr, par ' | hub pointer to hub address
ramptr add hubptr, #4 ' | hub pointer to ram address
lenptr mov ramptr, par ' | hub pointer to length
errptr add ramptr, #8 ' | hub pointer to error status
cmd mov lenptr, par ' | command I/R/W/G/P/Q
hubaddr add lenptr, #12 ' | hub address
ramaddr mov errptr, par ' | ram address
len add errptr, #16 ' | length
err nop ' -+ error status returned (=0=false=good)
' Initialise hardware tristates everything and read/write set the pins
init mov err, #0 ' reset err=false=good
'mov dira,zero ' tristate the pins with the cog dira
and dira,maskP0P20P22 ' tristates all the common pins
done wrlong err, errptr ' status =0=false=good, else error x
wrlong zero, comptr ' command =0 (done)
' wait for a command (pause short time to reduce power)
pause
' mov ctr, delay wz ' if =0 no pause
' if_nz add ctr, cnt
' if_nz waitcnt ctr, #0 ' wait for a short time (reduces power)
rdlong cmd, comptr wz ' command ?
if_z jmp #pause ' not yet
' decode command
cmp cmd, #"S" wz ' hub to ram
if_z jmp #pasmhubtoram
cmp cmd, #"T" wz ' ram to hub
if_z jmp #pasmramtohub
cmp cmd, #"U" wz ' ram to display
if_z jmp #pasmramtodisplay
cmp cmd, #"V" wz ' hub to display
if_z jmp #pasmhubtodisplay
cmp cmd, #"E" wz ' convert 3 byte .raw format to 2 byte .ili format - hub to hub
if_z jmp #rawtoiliformat
cmp cmd, #"F" wz ' convert 3 byte .bmp format BGR to 2 byte ili format (same as E but order reversed)
if_z jmp #bmptoiliformat
' cmp cmd, #"W" wz ' lcdwritecom in pasm, not working
' if_z jmp #pasmlcdwritecom
cmp cmd, #"X" wz ' merge icon and background based on a mask
if_z jmp #mergeicons
cmp cmd, #"Y" wz ' change the 137 output
if_z jmp #changegroup
cmp cmd, #"Z" wz ' set the 161 counters
if_z jmp #set161
cmp cmd, #"I" wz ' init
if_z jmp #init
mov err, cmd ' error = cmd (unknown command)
jmp #done
' ----------------- common routines -------------------------------------
get_values rdlong hubaddr, hubptr ' get hub address
rdlong ramaddr, ramptr ' get ram address
rdlong len, lenptr ' get length
mov err, #5 ' err=5
get_values_ret ret
' Pass pasm_n = 0- 7 come to this with P0-P20 and P22 tristated and returns them as this too
set137 or dira,maskP22 ' pin 22 is an output
andn outa,maskP22 ' set P22low so Y0-Y7 are all high
or dira,maskP0P20 ' pins P0-P20 are outputs
and outa,maskP0P2low ' set these 3 pins low
or outa,pasm_n ' set the 137 pins
or outa,maskP22 ' pin 22 high
set137_ret ret ' return
load161pasm ' uses ramaddr
or outa,maskP0P20 ' set P0-P20 high
or dira,maskP0P20 ' output pins 0-20
mov pasm_n,#0 ' group 0
call #set137 ' set the 137 output
and outa,maskP0P18low ' pins 0-18 set low
or outa,ramaddr ' output addres to 161 chips
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
andn outa,maskP19 ' clock low
andn outa,maskP20 ' load low
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
load161pasm_ret ret
stop jmp #stop ' for debugging
memorytransfer or dira,maskP16P20 ' so /wr and other pins definitely high
or outa,maskP16P20
mov pasm_n,#1 ' back to group 1 for memory transfer
call #set137 ' as next routine will always be group 1
or dira,maskP16P20 ' output pins 16-20
or outa,maskP16P20 ' set P16-P20 high (P0-P15 set as inputs or outputs in the calling routine)
memorytransfer_ret ret
busoutput or dira,maskP0P15 ' set prop pins 0-15 as outputs
busoutput_ret ret
businput and dira,maskP16P31 ' set P0-P15 as inputs
businput_ret ret
delaynop nop
nop
nop
nop
delaynop_ret ret
' ------------------ single letter commands -------------------------------------
' command S
pasmhubtoram call #get_values ' get hubaddr,ramaddr,len
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #busoutput ' set prop pins P0-P15 as outputs
hubtoram_loop and outa,maskP16P31 '%11111111_11111111_00000000_00000000 ' clear for output
rdword data_16,hubaddr ' get the word from hub
and data_16,maskP0P15 ' mask to a word only
or outa,data_16 ' send out the byte to P0-P15
andn outa,maskP17 ' set mem write low
add hubaddr,#2 ' increment by 2 bytes = 1 word. Put this here for small delay while writes
or outa,maskP17 ' mem write high
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
djnz len,#hubtoram_loop ' loop this many times
jmp #init ' tristate pins and listen for commands
' command T
pasmramtohub call #get_values ' get hubaddr,ramaddr,len
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #businput ' set prop pins P0-P15 as inputs
andn outa,maskP16 ' memory /rd low
ramtohub_loop mov data_16,ina ' get the data
wrword data_16,hubaddr ' move data to hub
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
add hubaddr,#2 ' increment the hub address
djnz len,#ramtohub_loop
or outa,maskP16 ' memory /rd high
or dira,maskP0P15 ' %00000000_00000000_11111111_11111111 restore P0-P15as outputs
jmp #init ' ' tristate pins and listen for commands
' command U
pasmramtodisplay call #get_values ' get hubaddr,ramaddr,len
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #businput ' set prop pins 0-15 as inputs so doesn't interfere with the transfer
or outa,maskP18 ' ILI_RS high
andn outa,maskP16 ' memory /rd low
ramtodisplay_loop andn outa,maskP20 ' ILI write low
or outa,maskP20 ' ILI write high
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
djnz len,#ramtodisplay_loop
or outa,maskP16 ' memory /rd high
or dira,maskP0P15 ' %00000000_00000000_11111111_11111111 restore P0-P15as outputs
jmp #init
' command V
pasmhubtodisplay call #get_values ' get hubaddr,ramaddr,len
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #busoutput ' set P0-P15 as outputs
hubtodisplay_loop and outa,maskP16P31 '%11111111_11111111_00000000_00000000 ' clear for output
rdword data_16,hubaddr ' get the word from hub
and data_16,maskP0P15 ' mask to a word only
or outa,data_16 ' send out the byte to P0-P15
andn outa,maskP20 ' ILI write low
or outa,maskP20 ' ILI write high
add hubaddr,#2 ' one word
djnz len,#hubtodisplay_loop
jmp #init ' set pins to tristate
' variables
pasm_n long 0 ' general purpose value
data_16 long 0 ' general purpose value
ililow long 0 ' low data byte
ilihigh long 0 ' high data byte
red long 0 ' red, green blue variables
green long 0
blue long 0
' constants
Zero long %00000000_00000000_00000000_00000000 ' used in several places
maskP0P2low long %11111111_11111111_11111111_11111000 ' P0-P2 low
maskP0P20 long %00000000_00011111_11111111_11111111 ' P0-P18 enabled for output plus P19,P20
maskP0P18low long %11111111_11111000_00000000_00000000 ' P0-P18 low
maskP16 long %00000000_00000001_00000000_00000000 ' pin 16
maskP17 long %00000000_00000010_00000000_00000000 ' pin 17
maskP18 long %00000000_00000100_00000000_00000000 ' pin 18
maskP19 long %00000000_00001000_00000000_00000000 ' pin 19
maskP20 long %00000000_00010000_00000000_00000000 ' pin 20
maskP22 long %00000000_01000000_00000000_00000000 ' pin 22
maskP16P31 long %11111111_11111111_00000000_00000000 ' pin 16 to pin 31
maskP0P15 long %00000000_00000000_11111111_11111111 ' for masking words
maskP16P20 long %00000000_00011111_00000000_00000000
maskP0P20P22 long %11111111_10100000_00000000_00000000 ' for returning all group pins HiZ
fit 496
Tried turning the serial speed down. I still can't decipher the output from "t a". looks like gibberish:
OK, fixed that one! Now should I take the Load161 and Set137 and "stuff" them into the rd-wr? Do I need to re-arrange the variables? Any other thoughts. You alluded to there being more than that but did not say what. Your help is MOST appreciated!
' Pass pasm_n = 0- 7 come to this with P0-P20 and P22 tristated and returns them as this too
set137 or dira,maskP22 ' pin 22 is an output
andn outa,maskP22 ' set P22low so Y0-Y7 are all high
or dira,maskP0P20 ' pins P0-P20 are outputs
and outa,maskP0P2low ' set these 3 pins low
or outa,pasm_n ' set the 137 pins
or outa,maskP22 ' pin 22 high
set137_ret ret ' return
load161pasm ' uses vmaddr
mov len, line_size ' make a copy of line_size AND.
shr len, #1 ' devide lenght by two for word-byte
or outa,maskP0P20 ' set P0-P20 high
or dira,maskP0P20 ' output pins 0-20
mov pasm_n,#0 ' group 0
call #set137 ' set the 137 output
and outa,maskP0P18low ' pins 0-18 set low
or outa,vmaddr ' output addres to 161 chips
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
andn outa,maskP19 ' clock low
andn outa,maskP20 ' load low
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
load161pasm_ret ret
stop jmp #stop ' for debugging
memorytransfer or dira,maskP16P20 ' so /wr and other pins definitely high
or outa,maskP16P20
mov pasm_n,#1 ' back to group 1 for memory transfer
call #set137 ' as next routine will always be group 1
or dira,maskP16P20 ' output pins 16-20
or outa,maskP16P20 ' set P16-P20 high (P0-P15 set as inputs or outputs in the calling routine)
memorytransfer_ret ret
busoutput or dira,maskP0P15 ' set prop pins 0-15 as outputs
busoutput_ret ret
businput and dira,maskP16P31 ' set P0-P15 as inputs
businput_ret ret
delaynop nop
nop
nop
nop
delaynop_ret ret
'----------------------------------------------------------------------------------------------------
'
' rd_cache_line - read a cache line from external memory
'
' vmaddr is the external memory address to read
' hubaddr is the hub memory address to write
' line_size is the number of bytes to read
'
'----------------------------------------------------------------------------------------------------
rd_cache_line
' command T
pasmramtohub
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #businput ' set prop pins P0-P15 as inputs
andn outa,maskP16 ' memory /rd low
ramtohub_loop mov data_16,ina ' get the data
wrword data_16,hubaddr ' move data to hub
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
add hubaddr,#2 ' increment the hub address
djnz len,#ramtohub_loop
or outa,maskP16 ' memory /rd high
or dira,maskP0P15 ' %00000000_00000000_11111111_11111111 restore P0-P15as outputs
and dira,maskP0P20P22 ' tristates all the common pins
rd_cache_line_ret
ret
'----------------------------------------------------------------------------------------------------
'
' wr_cache_line - write a cache line to external memory
'
' vmaddr is the external memory address to write
' hubaddr is the hub memory address to read
' line_size is the number of bytes to write
'
'----------------------------------------------------------------------------------------------------
wr_cache_line
' command S
pasmhubtoram
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #busoutput ' set prop pins P0-P15 as outputs
hubtoram_loop and outa,maskP16P31 '%11111111_11111111_00000000_00000000 ' clear for output
rdword data_16,hubaddr ' get the word from hub
and data_16,maskP0P15 ' mask to a word only
or outa,data_16 ' send out the byte to P0-P15
andn outa,maskP17 ' set mem write low
add hubaddr,#2 ' increment by 2 bytes = 1 word. Put this here for small delay while writes
or outa,maskP17 ' mem write high
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
djnz len,#hubtoram_loop ' loop this many times
and dira,maskP0P20P22 ' tristates all the common pins
wr_cache_line_ret
ret
pasm_n long 0 ' general purpose value
data_16 long 0 ' general purpose value
len long 0 ' copy line size and devide by two for byte-word offset
maskP0P2low long %11111111_11111111_11111111_11111000 ' P0-P2 low
maskP0P20 long %00000000_00011111_11111111_11111111 ' P0-P18 enabled for output plus P19,P20
maskP0P18low long %11111111_11111000_00000000_00000000 ' P0-P18 low
maskP16 long %00000000_00000001_00000000_00000000 ' pin 16
maskP17 long %00000000_00000010_00000000_00000000 ' pin 17
maskP18 long %00000000_00000100_00000000_00000000 ' pin 18
maskP19 long %00000000_00001000_00000000_00000000 ' pin 19
maskP20 long %00000000_00010000_00000000_00000000 ' pin 20
maskP22 long %00000000_01000000_00000000_00000000 ' pin 22
maskP16P31 long %11111111_11111111_00000000_00000000 ' pin 16 to pin 31
maskP0P15 long %00000000_00000000_11111111_11111111 ' for masking words
maskP16P20 long %00000000_00011111_00000000_00000000
maskP0P20P22 long %11111111_10100000_00000000_00000000 ' for returning all group pins HiZ
fit 496
*edited*
Now I'm getting this:
CACHE TEST> t a
Test 0
Address Walking 0's 15 address bits.
00007ff8 00000000
00007ff4 00000000
00007fec 00000000
00007fdc 00000000
00007fbc 00000000
00007f7c 00000000
00007efc 00000000
00007dfc 00000000
00007bfc 00000000
000077fc 00000000
00006ffc 00000000
00005ffc 00002000
ERROR! Expected 0 @ 00007ffc after write to address 00005ffc
00002000
Test 3
Random Addr Pattern Test 32 KB
----------------------------------------------------------------
cccccccccccccccccccccccccccccccc
w
r
Test Complete!
CACHE TEST> t 4
Test 4
Pseudo-Random Pattern Test -1656445 KB
----------------------------------------------------------------
w
r
ERROR at $00000000 Expected $174b43e3 Received $cd32dcc6
Address $00000000 0K Page
re #1. I've looked all over and I can't find it. I'll do some testing in the spin driver shortly, but the circuit appears correct, no open connection. No short to ANYTHING! Tested both boards with same results?
Well at least you verified that. So, for sure it's a driver problem.
re #2. It looks like this is the first candidate for fixing. Any suggestions how to approach a fix?
The length division is necessary because you are writing/reading 2 bytes at a time.
Actually, it's very curious that the data looks like it's decrementing.
Need data. Do the following:
1. Fill the cache line with incremental pattern.
CACHE TEST> i 2000
Incremental Pattern Test 8 KB
----------------------------------------------------------------
wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Test Complete!
' check address 0 for 00000001
CACHE TEST> r 0
R 00000000 00000001
' flush address 0 by reading 2000
CACHE TEST> r 0
' check address 0 again - what value do you see?
CACHE TEST> r 0
hmm, it seems as if the cache is not flushing on second "r"
CACHE TEST> i 2000
Incremental Pattern Test 8 KB
----------------------------------------------------------------
wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Test Complete!
CACHE TEST> r 0
R 00000000 00000001
CACHE TEST> r 0
R 00000000 00000001
CACHE TEST> r 0
R 00000000 00000001
*edited*
I have done my best to verify proper operation. I'm currently stepping through addresses looking for anomalies.
Hi David, sorry about that. I'll post the entire driver again.
{
Skeleton JCACHE external RAM driver
Copyright (c) 2011 by David Betz
Based on code by Steve Denson (jazzed)
Copyright (c) 2010 by John Steven Denson
Inspired by VMCOG - virtual memory server for the Propeller
Copyright (c) February 3, 2010 by William Henning
For the EuroTouch 161 By James Moxham and Joe Heinz
TERMS OF USE: MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
}
CON
' default cache dimensions
DEFAULT_INDEX_WIDTH = 6
DEFAULT_OFFSET_WIDTH = 4
' cache line tag flags
EMPTY_BIT = 30
DIRTY_BIT = 31
PUB image
return @init_vm
DAT
org $0
' initialization structure offsets
' $0: pointer to a two word mailbox
' $4: pointer to where to store the cache lines in hub ram
' $8: number of bits in the cache line index if non-zero (default is DEFAULT_INDEX_WIDTH)
' $a: number of bits in the cache line offset if non-zero (default is DEFAULT_OFFSET_WIDTH)
' note that $4 must be at least 2^(index_width+offset_width) bytes in size
' the cache line mask is returned in $0
init_vm mov t1, par ' get the address of the initialization structure
rdlong pvmcmd, t1 ' pvmcmd is a pointer to the virtual address and read/write bit
mov pvmaddr, pvmcmd ' pvmaddr is a pointer into the cache line on return
add pvmaddr, #4
add t1, #4
rdlong cacheptr, t1 ' cacheptr is the base address in hub ram of the cache
add t1, #4
rdlong t2, t1 wz
if_nz mov index_width, t2 ' override the index_width default value
add t1, #4
rdlong t2, t1 wz
if_nz mov offset_width, t2 ' override the offset_width default value
mov index_count, #1
shl index_count, index_width
mov index_mask, index_count
sub index_mask, #1
mov line_size, #1
shl line_size, offset_width
mov t1, line_size
sub t1, #1
wrlong t1, par
' put external memory initialization here
jmp #vmflush
fillme long 0[128-fillme] ' first 128 cog locations are used for a direct mapped cache table
fit 128
' initialize the cache lines
vmflush movd :flush, #0
mov t1, index_count
:flush mov 0-0, empty_mask
add :flush, dstinc
djnz t1, #:flush
' start the command loop
waitcmd wrlong zero, pvmcmd
:wait rdlong vmline, pvmcmd wz
if_z jmp #:wait
shr vmline, offset_width wc ' carry is now one for read and zero for write
mov set_dirty_bit, #0 ' make mask to set dirty bit on writes
muxnc set_dirty_bit, dirty_mask
mov line, vmline ' get the cache line index
and line, index_mask
mov hubaddr, line
shl hubaddr, offset_width
add hubaddr, cacheptr ' get the address of the cache line
wrlong hubaddr, pvmaddr ' return the address of the cache line
movs :ld, line
movd :st, line
:ld mov vmcurrent, 0-0 ' get the cache line tag
and vmcurrent, tag_mask
cmp vmcurrent, vmline wz ' z set means there was a cache hit
if_nz call #miss ' handle a cache miss
:st or 0-0, set_dirty_bit ' set the dirty bit on writes
jmp #waitcmd ' wait for a new command
' line is the cache line index
' vmcurrent is current cache line
' vmline is new cache line
' hubaddr is the address of the cache line
miss movd :test, line
movd :st, line
:test test 0-0, dirty_mask wz
if_z jmp #:rd ' current cache line is clean, just read new one
mov vmaddr, vmcurrent
shl vmaddr, offset_width
call #wr_cache_line ' write current cache line
:rd mov vmaddr, vmline
shl vmaddr, offset_width
call #rd_cache_line ' read new cache line
:st mov 0-0, vmline
miss_ret ret
' pointers to mailbox entries
pvmcmd long 0 ' on call this is the virtual address and read/write bit
pvmaddr long 0 ' on return this is the address of the cache line containing the virtual address
cacheptr long 0 ' address in hub ram where cache lines are stored
vmline long 0 ' cache line containing the virtual address
vmcurrent long 0 ' current selected cache line (same as vmline on a cache hit)
line long 0 ' current cache line index
set_dirty_bit long 0 ' DIRTY_BIT set on writes, clear on reads
zero long 0 ' zero constant
dstinc long 1<<9 ' increment for the destination field of an instruction
t1 long 0 ' temporary variable
t2 long 0 ' temporary variable
tag_mask long !(1<<DIRTY_BIT) ' includes EMPTY_BIT
index_width long DEFAULT_INDEX_WIDTH
index_mask long 0
index_count long 0
offset_width long DEFAULT_OFFSET_WIDTH
line_size long 0 ' line size in bytes
empty_mask long (1<<EMPTY_BIT)
dirty_mask long (1<<DIRTY_BIT)
' input parameters to rd_cache_line and wr_cache_line
vmaddr long 0 ' external address
hubaddr long 0 ' hub memory address
'get_values rdlong hubaddr, hubptr ' get hub address
' rdlong ramaddr, ramptr ' get ram address
' rdlong len, lenptr ' get length
' mov err, #5 ' err=5
'get_values_ret ret
'init 'mov err, #0 ' reset err=false=good
'mov dira,zero ' tristate the pins with the cog dira
' and dira,maskP0P20P22 ' tristates all the common pins
'done 'wrlong err, errptr ' status =0=false=good, else error x
'wrlong zero, comptr ' command =0 (done)
' Pass pasm_n = 0- 7 come to this with P0-P20 and P22 tristated and returns them as this too
set137 or dira,maskP22 ' pin 22 is an output
andn outa,maskP22 ' set P22low so Y0-Y7 are all high
or dira,maskP0P20 ' pins P0-P20 are outputs
and outa,maskP0P2low ' set these 3 pins low
or outa,pasm_n ' set the 137 pins
or outa,maskP22 ' pin 22 high
set137_ret ret ' return
load161pasm ' uses vmaddr
mov len, line_size ' make a copy of line_size AND.
shr len, #1 ' devide lenght by two for word-byte
or outa,maskP0P20 ' set P0-P20 high
or dira,maskP0P20 ' output pins 0-20
mov pasm_n,#0 ' group 0
call #set137 ' set the 137 output
and outa,maskP0P18low ' pins 0-18 set low
or outa,vmaddr ' output addres to 161 chips
'or outa,maskP19 ' clock high
'or outa,maskP20 ' load high
'andn outa,maskP19 ' clock low
'andn outa,maskP20 ' load low
'or outa,maskP19 ' clock high
'or outa,maskP20 ' load high
andn outa,maskP20 ' load low
andn outa,maskP19 ' clock low
or outa,maskP19 ' clock high
or outa,maskP20 ' load high
load161pasm_ret ret
stop jmp #stop ' for debugging
memorytransfer or dira,maskP16P20 ' so /wr and other pins definitely high
or outa,maskP16P20
mov pasm_n,#1 ' back to group 1 for memory transfer
call #set137 ' as next routine will always be group 1
or dira,maskP16P20 ' output pins 16-20
or outa,maskP16P20 ' set P16-P20 high (P0-P15 set as inputs or outputs in the calling routine)
memorytransfer_ret ret
busoutput or dira,maskP0P15 ' set prop pins 0-15 as outputs
busoutput_ret ret
businput and dira,maskP16P31 ' set P0-P15 as inputs
businput_ret ret
delaynop nop
nop
nop
nop
delaynop_ret ret
'----------------------------------------------------------------------------------------------------
'
' rd_cache_line - read a cache line from external memory
'
' vmaddr is the external memory address to read
' hubaddr is the hub memory address to write
' line_size is the number of bytes to read
'
'----------------------------------------------------------------------------------------------------
rd_cache_line
' command T
pasmramtohub
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #businput ' set prop pins P0-P15 as inputs
andn outa,maskP16 ' memory /rd low
ramtohub_loop mov data_16,ina ' get the data
wrword data_16,hubaddr ' move data to hub
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
add hubaddr,#2 ' increment the hub address
djnz len,#ramtohub_loop
or outa,maskP16 ' memory /rd high
or dira,maskP0P15 ' %00000000_00000000_11111111_11111111 restore P0-P15as outputs
and dira,maskP0P20P22 ' tristates all the common pins
rd_cache_line_ret
ret
'----------------------------------------------------------------------------------------------------
'
' wr_cache_line - write a cache line to external memory
'
' vmaddr is the external memory address to write
' hubaddr is the hub memory address to read
' line_size is the number of bytes to write
'
'----------------------------------------------------------------------------------------------------
wr_cache_line
' command S
pasmhubtoram
call #load161pasm ' load the 161 counters with ramaddr
call #memorytransfer ' set to group 1, enable P16-P20 as outputs and set P16-P20 high
call #busoutput ' set prop pins P0-P15 as outputs
hubtoram_loop and outa,maskP16P31 '%11111111_11111111_00000000_00000000 ' clear for output
rdword data_16,hubaddr ' get the word from hub
and data_16,maskP0P15 ' mask to a word only
or outa,data_16 ' send out the byte to P0-P15
andn outa,maskP17 ' set mem write low
add hubaddr,#2 ' increment by 2 bytes = 1 word. Put this here for small delay while writes
or outa,maskP17 ' mem write high
andn outa,maskP19 ' clock 161 low
or outa,maskP19 ' clock 161 high
djnz len,#hubtoram_loop ' loop this many times
and dira,maskP0P20P22 ' tristates all the common pins
wr_cache_line_ret
ret
pasm_n long 0 ' general purpose value
data_16 long 0 ' general purpose value
len long 0 ' copy line size and devide by two for byte-word offset
maskP0P2low long %11111111_11111111_11111111_11111000 ' P0-P2 low
maskP0P20 long %00000000_00011111_11111111_11111111 ' P0-P18 enabled for output plus P19,P20
maskP0P18low long %11111111_11111000_00000000_00000000 ' P0-P18 low
maskP16 long %00000000_00000001_00000000_00000000 ' pin 16
maskP17 long %00000000_00000010_00000000_00000000 ' pin 17
maskP18 long %00000000_00000100_00000000_00000000 ' pin 18
maskP19 long %00000000_00001000_00000000_00000000 ' pin 19
maskP20 long %00000000_00010000_00000000_00000000 ' pin 20
maskP22 long %00000000_01000000_00000000_00000000 ' pin 22
maskP16P31 long %11111111_11111111_00000000_00000000 ' pin 16 to pin 31
maskP0P15 long %00000000_00000000_11111111_11111111 ' for masking words
maskP16P20 long %00000000_00011111_00000000_00000000
maskP0P20P22 long %11111111_10100000_00000000_00000000 ' for returning all group pins HiZ
fit 496
I just sent a test board to Steve, it should be there on Sunday. The notes on the board for Steve's reference are as follows :
# 1. Does not match *optimized clock logic* : Wired for a 74hc00 instead of a 74hc08. I can walk Steve through the changes if necessary but the logic DOES WORK.
*edit* drawn wrong, correct now
# 2. Prop Plug pin1 is Pin14 of max chip *4 pin header installed.
# 3. IC8 *just under LEDs* is the 74HC00, sorry it's not loaded.
# 4. The 4 Pin header, bottom left is +5, +3.3 and GND.
# 5.I think I forgot to load an eeprom in the socket before shipping, sorry.
That should be it. If you have any questions please let me know.
Thanks again for all your time and help guys!
# 1. Does not match *optimized load logic* : Wired for a 74hc00 instead of a 74hc08. I can walk Steve through the changes if necessary but the logic DOES WORK. Attachment not found.
Joe, Is P19 connected directly to the counter clock inputs?
Steve, the schematic was drawn wrong. Load is supposed to be clock. P20 is supposed to be P19. I will fix and update in a second, sorry about that.
P20 is the load, or-d with Group 0.
*edit*
I will work up a description for you. Schematic is posted on 109 but I will edit this post to contain text description.
P0-15 ----> Data bus
P0-18 ----> Address bus to 161s
P 0-2 ----> 137 Address
P 22 ----> 137 Latch/Enable
Comments
*edit*
I see line_size controls the number of bytes to read and write. Is there a way to make sure this is an even number? That would make things a bit easier since I could just divide this by 2 to get the number of transfers from SRAM. Then, when doing the transfer: Read a word from Ram and put in 2 bytes in hub. Or read 2 bytes from hub and put a word in RAM. Does this sound like I'm heading in the right direction?
*edit*
So far, it's mostly a "copy-paste-modify" job. It should be getting close though.
This is the overhead for the driver: And the actual reads: The writes: Last is the variable overhead: I assume I don't need the "get_value" since that's already taken care of? Still need to work out the byte-word alignment problem but...
Result of the BSTC are Any thoughts on how to fix the byte-word alignment issue? Do 2 transfers per loop and shift "line_size" right by 1 at the start?
*edit* ok RTFM and I think I got it... I'll be back to you with the results:
So if you you take the standard XMM code for the Board of Education and just change the SD pin numbers you can run huge C programs on our board.
That simplifies things because then all the SRAM accesses are done from C code. And all the PASM part does not need to change. So all you really have to do is rewrite the PUB docmd() routine to run from C, as well as change the cog start and stop methods to the C syntax.
GCC never needs to know about the SRAM and that means that GCC does not need changing.
A few weeks back I thought a lot about using the SRAM as the cache for GCC but I really don't think we need it any more. And that frees up the entire SRAM for bitmaps and fonts and large text buffers - all of which will be handled by a C program (which will probably ultimately become a .h program). There is a Spin to C translator out there, so it may even be possible to feed routines into this and porting over the spin code we have should be even easier. The pasm part stays as it is, and the spin gets converted to C.
Have you had a chance to test the code I posted on the GUI thread and the Files thread?
http://forums.parallax.com/showthread.php?140010-Files&p=1099650&viewfull=1#post1099650
Yes it would be interesting to do the experiment. I was basing my comments on experiments we did last year with Catalina where we found in the end that caching was so efficient that the actual speed of transfers to and from the cache hardly seemed to affect the overall performance.
But, hypothetically, if one did use the sram, there are two simple routines to transfer data to and from hub. Pass to a pasm routine a single letter command to send or receive, a long for hub address, a long for ram address and number of bytes.
And I guess you need a flag somewhere in hub to say the code is using the sram or the cache is using the sram.
I don't think it will be needed though. XMMC from hub is fast, and it could be as fast as LMM when functions are small enough to fit in hub. Whenever something runs a bit slow, I find it is easier to port things over to pasm.
Is there a generic big C program we can use to test XMMC? A text based adventure program or something?
With the right device, XMMC is faster than SPIN; with the wrong device (SDcard) it is slower than SPIN. XMMC on EEPROM is faster than XMMC on SDcard.
XMM-SINGLE lets you use all of SRAM for code and data; it is usually slower than XMMC.
One of the bigger programs I've used for testing is David's EBASIC2. It works great and can quickly highlight performance differences. EBASIC2 is in the Propeller-GCC demos package.
Here's what I have so far. Please advise the best place to put the "overhead" such as Load161. I'm also not sure of what tests to run in test_cache once the cache file is built. Thanks as always!
*edit*
I tried "T a" and get a bunch of nonsense from the terminal?
Can you post the full .spin file? Do you have a pointer to the schematic?
Here's the fully filled out skeleton cache *HOPEFULLY*
I believe I have filled this out. The "overhead" and rd and wr cache lines *Bottom of file up* were taken from this code that's been tested good..It's a bit long, so I remove some of the portions.. Tried turning the serial speed down. I still can't decipher the output from "t a". looks like gibberish:
The biggest problem is that you're trashing the line_size variable. You need to make a copy of it to your own length variable. I.e.
*edited*
Now I'm getting this: Test 3 will always succeed??
There are many optimizations possible in your code. Let's get it working right first.
Other points:
- Looks like you have an aliasing problem with address bit A13. There may be an open connection from the 74161 to SRAM.
- The incremental test should have data always increasing from 0. Your data is swapped on even/odd addresses.
- The one test that is passing all the time does that because the test range is not big enough.
- Not sure what's happening with the last test. I've never seen a negative test range before with test_cache.spin.
Other notes:Well at least you verified that. So, for sure it's a driver problem.
The length division is necessary because you are writing/reading 2 bytes at a time.
Actually, it's very curious that the data looks like it's decrementing.
Need data. Do the following:
I have to run some errands. Be back in a while.
--Steve
I'm sorry, I meant
CACHE TEST> r 2000
then
CACHE TEST> r 0
Could you post your entire driver .spin file? It's kind of hard to see what's happening with the fragments embedded in your messages.
Thanks,
David
I just sent a test board to Steve, it should be there on Sunday. The notes on the board for Steve's reference are as follows :
# 1. Does not match *optimized clock logic* : Wired for a 74hc00 instead of a 74hc08. I can walk Steve through the changes if necessary but the logic DOES WORK.
*edit* drawn wrong, correct now
# 2. Prop Plug pin1 is Pin14 of max chip *4 pin header installed.
# 3. IC8 *just under LEDs* is the 74HC00, sorry it's not loaded.
# 4. The 4 Pin header, bottom left is +5, +3.3 and GND.
# 5.I think I forgot to load an eeprom in the socket before shipping, sorry.
That should be it. If you have any questions please let me know.
Thanks again for all your time and help guys!
Joe, Is P19 connected directly to the counter clock inputs?
Thanks,
--Steve
P20 is the load, or-d with Group 0.
*edit*
I will work up a description for you. Schematic is posted on 109 but I will edit this post to contain text description.
P0-15 ----> Data bus
P0-18 ----> Address bus to 161s
P 0-2 ----> 137 Address
P 22 ----> 137 Latch/Enable
P16 ----> SRAM RD
P17 ----> SRAM WR
Group1 ----> SRAM CS
P20 ORd Group0 ----> Load 161s
Group 0 AND Group 1 ---> ORd Pin 19 ----> Clock 161
That should cover the basics.
Thanks a million guys!