Propeller Caching XMM Longs for LMM

jazzed · 2009-12-09 20:40

What I'm hoping to discover is the best practice for caching XMM longs for Propeller LMM execution.

This is not a new idea of course. Caching is used everywhere. Bill Henning appears do be doing some caching in some Morpheus code. The ICC kernel has data caching. I've no idea what Catalina does. Spin does not cache of course [noparse]:)[/noparse]

What is the best practice for Propeller instruction caching?

In the rest of this post, I try to think through the question out loud. Help me make sense of it.
Data cache should be considered too? ... probably.

In the simplest terms, a cache is a store of information where data is quickly accessed. There are different cache levels commonly used ... on chip primary or level 1, secondary or level 2 (on or off chip), tertiary or level 3 (virtual memory for swap, etc.), and others dependent upon the architect's whim.

When thinking about Propeller cache, one could specify one or more of the following:

1. Intra-COG Cache

How would an Intra-COG cache work? ... it would have to be small, simple, and would be best if starting at 0. Tags would complicate things too much.
What advantages would it have? ... not always using HUB instructions.
Is it possible to read enough longs from HUB fast enough for an LMM Intra-COG cache to be useful?
Is it possible to use XMM directly? ... possible, but not as fast?

2. HUB Cache

How would a HUB Cache work? ... could maintain a larger cache. Tags are possible?
Does it make sense for one or more COGs to load and maintain a write-back cache with XMM?
With multiple cogs, read rates > 10MB/s are possible but take work.

3. XMM Virtual Memory Cache

Can one manage Virtual Memory in XMM from a HUB Cache manager or other method?
would this most likely be necessary for larger programs?

4. SD card, other serial, or even tiny HDD Virtual Swap.

When exactly does it become useful or imperative to start using a swap partition on "fixed" media like SD card?

Time:

In any event, we have certain constraints with Propeller. While it is great to employ COGs for our purposes, the simple design forces us to pay attention to time when making choices. As PASM coders know, there are windows of opportunity and breaking a window causes a jump to the next "quantum" period.

Consider the simple PropellerI LMM loop (skip if you know this). It takes 32 cycles because the 16 cycle window is broken. To mitigate this, the LMM is "unrolled" N times to increase performance so that instead of N+1 32 cycle instructions, LMM can execute N 16 cycle instructions and 1 32 cycle instruction.

nexti         rdlong inst, PC   ' get instruction
              add PC, #4        ' increment PC x4 to get next long
              inst nop          ' execute instruction
              jmp #nexti        ' 32 cycles

nexti         rdlong inst1, PC
              add PC, #4
inst1         nop
              rdlong inst2, PC
              add PC, #4
inst2         nop
              jmp #nexti  ' 16+32 cycles if no LMM jumps are executed

So how would a Cache work? Clearly regardless of what you would like, more instructions per loop are required.
Let's say that you have an LMM Kernel COG and one or more XMMU COG(s). The LMM loop might become something like this:

nexti         wrlong    PC, CachePtr ' assumes an address is never 0
              add       PC, #4
              rdlong    cinst, CdataPtr wz
        if_z  jmp       #$-1 ' wait for an instruction ... NOPs are padded non-zero
cinst         nop
              jmp       #nexti  ' 48 or more cycles depending on cache availability

Is there a better way? We have to waste cycles with that. At this point with no optimizations at all and considering a very efficient cache hit/miss COG, the instruction rate is maximum 1.67 MIPS at 80MHz. This is not bad considering the maximum aggregate unrolled LMM rate is about 2.6 MIPs.

Ok, so what does the Cache COG look like? Cache COG sounds like where the money is, and from what I've seen it adds up (or shrinks).
Let's say you have a Cache COG loop like this:

dat ' CACHE COG
:nextread
              rdlong    cadr,   AddrPtr wz      ' got milk?
        if_z  jmp #$-1                          ' repeat if nothing
              cmp       cadr,   lcache wc,wz
        if_ae cmp       cadr,   hcache wc,wz    ' only if address >= cache low addr
        if_ae call      #cacheread              ' if address >= cache high, jump to read
              and       cadr,   CMASK
              movd      :rdch,  cadr
              nop                              
:rdch         wrlong    0-0,    DataPtr
              jmp       #:nextread              ' 64 cycles minimum

cacheread     ' not defined yet
cacheread_ret ret
              
lcache        long 0
hcache        long 0
cadr          long 0
AddrPtr       long 0
DataPtr       long 0
CMASK         long $3f

With 64 cycles in the Cache COG, we get maximum 1.25 MIPS. Can we do better than that?
One way to do better would be to optimize the loop above, but I don't see it ... maybe someone else does.

Another way to do better seems to be to have an OnCOG cache. The OnCOG Cache would look much like the Cache COG, except that the only time we need HUB cycle time would be on a Cache miss. So, a routine might look like this:

:nextread
              mov       cadr,   PC
              cmp       cadr,   lcache wc,wz
        if_ae cmp       cadr,   hcache wc,wz    ' only if address >= cache low addr
        if_ae call      #cacheread              ' if address >= cache high, jump to read
              and       cadr,   CMASK
              movs      :rdch,  cadr
              nop                               ' make this  add PC, #4 ? for a loop?
:rdch         mov       inst,   DataPtr
              jmp       #:nextread              ' 36 cycles minimum

cacheread     ' not defined yet ... all hub operations go here                
cacheread_ret ret

Can we do better?

The other types of cache could be discussed here later.
Code fragments are not guaranteed to work.

This is all I have time to write now. I would appreciate thoughts on the subject and optimizations if any.

Thanks,
--Steve

RossH · 2009-12-09 22:44

HI jazzed,

For the record - Catalina doesn't do any caching ... yet.

I've given it some thought, but unfortunately, I haven't had time to do much about it. But here is my current view:

Although I appreciate the cleverness of it, I decided against the original LMM FCACHE approach (I think this equates to what you call an 'intra-Cog' cache) - it works beautifully on text-book examples, but in real-world programs it can be detrimental - this is because the kernel space it consumes can often be better used implementing other things that also speed up program execution. I've benchmarked this on actual C programs, and while there are indeed FCACHED programs that do execute much faster, they are often fairly simple cases - real-world programs often execute slower when FCACHED.

My current thinking is that a "hub cache" is the way to go. This would have another big advantage for Catalina in that one XMM hub cache could service multiple kernels without much additional effort. One of my original design goals for Catalina (not yet achieved) was to support multiple kernels running simultaneously on different cogs - but once I started supporting XMM this became more difficult to achieve because in most XMM implementatons only one cog can access the XMM memory effectively - the overheads of supporting multiple cogs accessing XMM would make it too slow to be any use at all.

Interesting topic - thanks for starting it. I look forward to hearing what others think.

Ross.

If soneone else does, I'll happily adopt.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

Bill Henning · 2009-12-09 22:50

Hi Steve,

I have been giving this much thought, and will be trying different approaches - as soon as I have some time [noparse]:)[/noparse]

The first pre-release of PropellerBasic will not do any sort of LMM caching, but I am planning to add full FCACHE support.

Later, when I add XLMM support, I will look into how to best speed up XLMM - I posted an idea in the other thread that should work, but it would chew up most of the cogs.

Now PropII... I am drooling at what will be possible with it, however I expect it will be quite a while before it is on the market.

Best Regards,

Bill

jazzed said...
What I'm hoping to discover is the best practice for caching XMM longs for Propeller LMM execution.

This is not a new idea of course. Caching is used everywhere. Bill Henning appears do be doing some caching in some Morpheus code. The ICC kernel has data caching. I've no idea what Catalina does. Spin does not cache of course [noparse]:)[/noparse]

What is the best practice for Propeller instruction caching?

In the rest of this post, I try to think through the question out loud. Help me make sense of it.
Data cache should be considered too? ... probably.

In the simplest terms, a cache is a store of information where data is quickly accessed. There are different cache levels commonly used ... on chip primary or level 1, secondary or level 2 (on or off chip), tertiary or level 3 (virtual memory for swap, etc.), and others dependent upon the architect's whim.

When thinking about Propeller cache, one could specify one or more of the following:
1. Intra-COG Cache

How would an Intra-COG cache work? ... it would have to be small, simple, and would be best if starting at 0. Tags would complicate things too much.
What advantages would it have? ... not always using HUB instructions.
Is it possible to read enough longs from HUB fast enough for an LMM Intra-COG cache to be useful?
Is it possible to use XMM directly? ... possible, but not as fast?

2. HUB Cache

How would a HUB Cache work? ... could maintain a larger cache. Tags are possible?
Does it make sense for one or more COGs to load and maintain a write-back cache with XMM?
With multiple cogs, read rates > 10MB/s are possible but take work.

3. XMM Virtual Memory Cache

Can one manage Virtual Memory in XMM from a HUB Cache manager or other method?
would this most likely be necessary for larger programs?

4. SD card, other serial, or even tiny HDD Virtual Swap.

When exactly does it become useful or imperative to start using a swap partition on "fixed" media like SD card?

Time:

In any event, we have certain constraints with Propeller. While it is great to employ COGs for our purposes, the simple design forces us to pay attention to time when making choices. As PASM coders know, there are windows of opportunity and breaking a window causes a jump to the next "quantum" period.

Consider the simple PropellerI LMM loop (skip if you know this). It takes 32 cycles because the 16 cycle window is broken. To mitigate this, the LMM is "unrolled" N times to increase performance so that instead of N+1 32 cycle instructions, LMM can execute N 16 cycle instructions and 1 32 cycle instruction.
nexti         rdlong inst, PC   ' get instruction
              add PC, #4        ' increment PC x4 to get next long
              inst nop          ' execute instruction
              jmp #nexti        ' 32 cycles
nexti         rdlong inst1, PC
              add PC, #4
inst1         nop
              rdlong inst2, PC
              add PC, #4
inst2         nop
              jmp #nexti  ' 16+32 cycles if no LMM jumps are executed              
So how would a Cache work? Clearly regardless of what you would like, more instructions per loop are required.
Let's say that you have an LMM Kernel COG and one or more XMMU COG(s). The LMM loop might become something like this:
nexti         wrlong    PC, CachePtr ' assumes an address is never 0
              add       PC, #4
              rdlong    cinst, CdataPtr wz
        if_z  jmp       #$-1 ' wait for an instruction ... NOPs are padded non-zero
cinst         nop
              jmp       #nexti  ' 48 or more cycles depending on cache availability
Is there a better way? We have to waste cycles with that. At this point with no optimizations at all and considering a very efficient cache hit/miss COG, the instruction rate is maximum 1.67 MIPS at 80MHz. This is not bad considering the maximum aggregate unrolled LMM rate is about 2.6 MIPs.

Ok, so what does the Cache COG look like? Cache COG sounds like where the money is, and from what I've seen it adds up (or shrinks).
Let's say you have a Cache COG loop like this:
dat ' CACHE COG
:nextread
              rdlong    cadr,   AddrPtr wz      ' got milk?
        if_z  jmp #$-1                          ' repeat if nothing
              cmp       cadr,   lcache wc,wz
        if_ae cmp       cadr,   hcache wc,wz    ' only if address >= cache low addr
        if_ae call      #cacheread              ' if address >= cache high, jump to read
              and       cadr,   CMASK
              movd      :rdch,  cadr
              nop                              
:rdch         wrlong    0-0,    DataPtr
              jmp       #:nextread              ' 64 cycles minimum

cacheread     ' not defined yet
cacheread_ret ret
              
lcache        long 0
hcache        long 0
cadr          long 0
AddrPtr       long 0
DataPtr       long 0
CMASK         long $3f
With 64 cycles in the Cache COG, we get maximum 1.25 MIPS. Can we do better than that?
One way to do better would be to optimize the loop above, but I don't see it ... maybe someone else does.

Another way to do better seems to be to have an OnCOG cache. The OnCOG Cache would look much like the Cache COG, except that the only time we need HUB cycle time would be on a Cache miss. So, a routine might look like this:
:nextread
              mov       cadr,   PC
              cmp       cadr,   lcache wc,wz
        if_ae cmp       cadr,   hcache wc,wz    ' only if address >= cache low addr
        if_ae call      #cacheread              ' if address >= cache high, jump to read
              and       cadr,   CMASK
              movs      :rdch,  cadr
              nop                               ' make this  add PC, #4 ? for a loop?
:rdch         mov       inst,   DataPtr
              jmp       #:nextread              ' 36 cycles minimum

cacheread     ' not defined yet ... all hub operations go here                
cacheread_ret ret
Can we do better?

The other types of cache could be discussed here later.
Code fragments are not guaranteed to work.

This is all I have time to write now. I would appreciate thoughts on the subject and optimizations if any.

Thanks,
--Steve

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com Please use mikronauts _at_ gmail _dot_ com to contact me off-forum, my PM is almost totally full
Morpheusdual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory IO board kit $89.95, both kits $189.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler for the Propeller Largos - a feature full nano operating system for the Propeller

kuroneko · 2009-12-10 06:03

jazzed said...
With 64 cycles in the Cache COG, we get maximum 1.25 MIPS. Can we do better than that?
One way to do better would be to optimize the loop above, but I don't see it ... maybe someone else does.

What about something like this (untested)? The only difference here is that filling the cache (fetch) has to be aware that cadr is now an offset relative to adrl.

query           rdlong  cadr, addr wz           '  +0 =
        if_z    jmp     #$-1                    '  +8
                sub     cadr, adrl              ' +12
                cmp     cadr, adrh wc           ' +16 =
                movd    response, cadr          ' +20
                and     response, mask          ' +24
        if_nc   call    #fetch                  ' +28
response        wrlong  0-0, addr               ' +32 =
                jmp     #query                  ' +40

fetch
fetch_ret       ret

mask            long    !(%111_000000 << 9)     ' limits dst to 0..$3F
cadr            long    0
addr            long    $7FFC
adrl            long    0
adrh            long    0

jazzed · 2009-12-10 06:38

kuroneko said...

jazzed said...
With 64 cycles in the Cache COG, we get maximum 1.25 MIPS. Can we do better than that?
One way to do better would be to optimize the loop above, but I don't see it ... maybe someone else does.

What about something like this (untested)? The only difference here is that filling the cache (fetch) has to be aware that cadr is now an offset relative to adrl.
query           rdlong  cadr, addr wz           '  +0 =
        if_z    jmp     #$-1                    '  +8
                sub     cadr, adrl              ' +12
                cmp     cadr, adrh wc           ' +16 =
                movd    response, cadr          ' +20
                and     response, mask          ' +24
        if_nc   call    #fetch                  ' +28
response        wrlong  0-0, addr               ' +32 =
                jmp     #query                  ' +40

fetch
fetch_ret       ret

mask            long    !(%111_000000 << 9)     ' limits dst to 0..$3F
cadr            long    0
addr            long    $7FFC
adrl            long    0
adrh            long    0

Nice. I'd have to go through it in the morning, but so far it looks like a sweet-spot.
Adding a "pad" for NOP (long 0) before response would make an XMM loop like this possible:

{
------------------------------------------------------------------------------
fetch long and execute              
}
nexti         wrlong    PC,     laddrp          ' 0,1   ... set l2cache address
              add       PC,     #4              ' 2
              nop                               ' 3
              rdlong    inst,   ldatap wz       ' 4,5   ... NOP is padded by l2cache cog
        if_z  jmp       #$-1                    ' 6
inst          nop                               ' 7
              jmp       #nexti                  ' 8     ... 12*4 = 48 = 600ns (@80MHz) minimum

Post Edited (jazzed) : 12/10/2009 6:43:13 AM GMT

Dr_Acula · 2009-12-10 11:17

This is certainly something to get the brain cells working. Thinking aloud with some practical examples, the zicog uses cog overlays but my understanding is the overlay then ends up sitting in hub ram and takes up space there. Eventually you run out of hub ram space.

The hardware solutions (morpheus/triblade/dracblade) show it is quite possible to add hardware to increase memory. 512k is a common value. There are the long discussions on speed and essentially speed is proportional to pins so a 28 pin solution using parallel ram is one extreme and a 2 pin I2C serial ram is at the other extreme.

For me, hub code is not speed critical. Hub spin is already so much slower than pasm that slowing it down a little further is not a problem. What I'd like to be able to do is to keep writing spin code without ever having to worry about running out of space. Could this be done?

First might be to require a different compiler. But BST has the nuts and bolts to do this. Next criteria might be to make the compiler as similar as possible to standard spin code (so rewriting it is not so daunting).

I wonder (maybe this has been done before?) if you could set up overlays in spin? Set aside 10k of overlay space. Start writing your spin code. When it runs out of space for another subroutine, instead of jumping to it locally, the compiler grabs it from high memory, puts it in that top 10k, runs it, then exits?

If all variables are kept local then shared global variables are not a problem (move a global to a local variable and pass it with the subroutine call?)

Brainstorming further, a 'program' might be a self contained binary file - say it was 100k. The bottom bit fits into the propeller and contains code to call higher bits of itself and move them into that upper 10k (or whatever) as required. You would need a way of storing a 100k program as it won't fit in eeprom, but again that is already possible with hardware that has sd card access.

Even cog overlays could be called from hub which are then called from upper memory.

Maybe you want more control over this than allowing the compiler to do it, but perhaps you could write your standard spin code and for overlay code you add an #ifdef overlay so the compiler knows that sits in high memory.

If this is too hard in spin, maybe it is possible in C or Basic? And it ought to be possible for different hardware if you define some standard procedures eg read_128_longs and write_128_longs and then only those need to be recoded for each hardware variant?

This idea of caching is very interesting.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller

jazzed · 2009-12-10 16:07

Dr_Acula said...
...
What I'd like to be able to do is to keep writing spin code without ever having to worry about running out of space. Could this be done?

First might be to require a different compiler. But BST has the nuts and bolts to do this. Next criteria might be to make the compiler as similar as possible to standard spin code (so rewriting it is not so daunting).

I wonder (maybe this has been done before?) if you could set up overlays in spin? Set aside 10k of overlay space. Start writing your spin code. When it runs out of space for another subroutine, instead of jumping to it locally, the compiler grabs it from high memory, puts it in that top 10k, runs it, then exits?

If all variables are kept local then shared global variables are not a problem (move a global to a local variable and pass it with the subroutine call?)
....

Yes it is possible. I started working on exactly this idea which I call "BigSpin", but got distracted
... jazzed looks out the window and shouts: "hey you kids get outta my yard!" ...
Now what was I doing? Oh yea, caching ....

This falls along the lines of items 3 & 4, but not exactly. If you look at my spin debugger, SPUD,
you will see how steal time and get stack/object base info from the spin interpreter. Like this....

'-------------------------------------------------------------------------------
'
' Send pcurr and dcurr to vmm, fetch opcode pcurr from vmm, and increment pcurr.
' Let vmm give us vmm pcurr if a vmm cache hit or old pcurr if no vmm hit.
' Pass dcurr to vmm to give method library index.
' Pass pcurr to vmm and wait for command to be processed.
' Get pcurr back from vmm in modified or same state as passed (saves instructions).
' Vmm will take pcurr and analyze the instructions around it to see if we are using a library.
' If the vmm needs a library method from off chip, it will tell mass storage to fetch,
' and then place the code in a hub memory location for the interpreter to use.
' Once vmm detects a return, the pcurr is restored back to normal.
'
' jazzed ... about propeller programming :)
'
fetch_pcurr
wrword dcurr, #4 ' save stack pointer
wrword vbase, #6 ' save stack pointer
wrlong pcurr, vmmcmdadr ' vmm will look for the "exec" code at pcurr.
wvmm rdlong outb, vmmcmdadr
tjnz outb, #wvmm ' wait for zero long
rdlong pcurr, #4 ' vmm gives back pcurr in unused hub address 4.
rdpcurr rdbyte 0-0, pcurr ' destination is set by calling routine
add pcurr, #1 ' increment pc
fetch_pcurr_ret ret

vmmcmdadr long VMMADDR

'
'-------------------------------------------------------------------------------

The fetch_pcurr routine is called near the spin Main Loop "loop" label.
Other support for the VMM idea you have described is in the file SpudVMM.spin in the SPUD zip.
http://forums.parallax.com/showthread.php?p=831329

The VMM framework is there but not fully tested. I added a way to do a method lookup based on
a stub method signature. Detecting the signature gives a "pointer" for which method body to load.

'-------------------------------------------------------------------------------
' is the byte code sequence a vmm instruction?
'
vmminst        long $38,$38,$65,$32     ' use longs for comparison
'

Using BSTC listings makes it possible to manage a library method body table, so we know where
to find the code and how big it is. The table and library code can live in fixed storage SDCARD,
EEPROM, Flash, or external memory.

A difficulty which you have addressed to a point is the object variables and dat references. For
fixed storage, I would be concerned about wearing out the device ... no problem for external
memory. One could restrict the methods artificially to use only local variables ... but that might
be very unpopular ....

The only real limit to what I've described is the number of methods Spin allows. Maybe that
can be relaxed in the compiler? Is it a function of the interpreter? Not sure.

So in summary, what you suggest is entirely possible. It's just a matter of someone implementing it [noparse]:)[/noparse]

BradC · 2009-12-10 23:51

jazzed said...

The only real limit to what I've described is the number of methods Spin allows. Maybe that
can be relaxed in the compiler? Is it a function of the interpreter? Not sure.

$FE spin methods per object in theory, although just the method table for that alone would consume ~1K.
The first byte in the method table is the number of sub-objects.
The second is the number of Spin methods (PUB+PRI) + 1 to account for the first method table entry
The next word is the relative address of the end of the object (the object length effectively)

So you are limited in the number of objects you can include (although the Parallax compiler limits that to 64 from memory) and the number of spin methods by the table entry size.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.

jazzed · 2009-12-11 20:23

Well I seem to do marginally better with caching than with straight access.

These routines do not include an L1 cache or any XMM access.

' An LMM loop
{
------------------------------------------------------------------------------
fetch long and execute ... for now just use L2 hub-cache ... add L1 cache later              
}
inst          nop                               ' execute instruction ....
nexti         wrlong    PC,     laddrp          ' set l2cache address
              xor       outa,   trig            ' free instruction
              add       PC,     #4              ' increment PC
              rdlong    l2chp,  ldatap          ' wait for nonzero hub pointer 
              tjz       l2chp,  #$-1            ' wait if hub ptr 0
              rdlong    inst,   l2chp           ' get instruction
              jmp       #inst                   ' Measured throughput is 16 ~1.6 MIPS (80MHz) 

' The L2 COG cache loop
{
------------------------------------------------------------------------------
' L2loop
' This is the main L2 Cache loop
' While it may take a while to deliver a cache hit long, the COG fetches in the background 
}
loadCache     rdlong    l2addr, l2addrp         ' reload since address is lost
              call      #L2readHUB              ' loadCache - delivers first hub ptr via l2datap
              'call      #L2readXMM              ' loadCache - delivers first hub ptr via l2datap
L2done        wrlong    l2zero, l2addrp         ' write 0 to clear data signal
L2loop        rdlong    l2addr, l2addrp wz      ' wait for user address
        if_z  jmp       #L2loop                 ' wait jump
              cmpsub    l2addr, l2lcache wc  ' is l2addr >= l2lcache ?
        if_nc cmp       l2addr, l2size   wc  ' only if l2addr >= l2lcache low addr
        if_nc add       l2addr, l2chp           ' add cache hub pointer base address
        if_c  jmp       #loadCache              ' only if l2addr >= l2hcache high addr
wrdata        wrlong    l2addr, l2datap         ' write hub pointer to user
              jmp       #L2done                 ' cache hit 12*4 = 48 ticks = 600ns (@80MHz)

I have L1 + XMM fetch code that that works in a single COG, but since it's all in one COG, I can't
pass the response back to the LMM engine without "blocking" (finishing the entire cache fetch).

I'm not sure at this point whether straight access like below is better than using an inter-cog L2
cache or not. For now it seems like a tie with L2 caching, but the L2 caching doesn't read the XMM yet.

{
------------------------------------------------------------------------------
fetch long and execute ... use this COG to fetch.              
}
' standard d07 a8...
nexti         mov       outa,   PC
              shl       outa,   #8
              mov       phsa,   _phsa           ' counters start with low so we
              mov       phsb,   _phsb           ' get 0011 and 0101 or 0123
              movs      inst,   ina             ' xxxxxxxx_xxxxxxxx_xxxxxxxx_AAAAAAAA C=A0
              ror       inst,   #8 wc           ' AAAAAAAA_xxxxxxxx_xxxxxxxx_xxxxxxxx C=A0
              movs      inst,   ina             ' AAAAAAAA_xxxxxxxx_xxxxxxxx_BBBBBBBB
              ror       inst,   #8              ' BBBBBBBB_AAAAAAAA_xxxxxxxx_xxxxxxxx
              movs      inst,   ina             ' BBBBBBBB_AAAAAAAA_xxxxxxxx_CCCCCCCC
              ror       inst,   #17             ' xDDDDDDD_DCCCCCCC_CBBBBBBB_BAAAAAAA
              movi      inst,   ina             ' xDDDDDDD_DCCCCCCC_CBBBBBBB_BAAAAAAA
              rcl       inst,   #1              ' DDDDDDDD_CCCCCCCC_BBBBBBBB_AAAAAAAA
              add       PC,     #4
inst          nop
              jmp       #nexti                  ' 15 COG / XMM inst = 1.66 MIPS (100MHz) 1.33 MIPS (80MHz)

jazzed · 2009-12-21 01:35

Ok, I'm getting XLMM 1.42 MIPS cached at 80MHz (64 fetch & execute transactions in 45us).
This compares with ~1.1 MIPS with direct fetch and execute XLMM. The cache fetches are
burst, the direct fetches are one at a time. The actual direct XMM access time is 750ns but
overhead makes it worse.

Cache hit time at 80MHz is 600ns. Cache miss time is about 6us ... not great, but 8 longs get
read in that time. Theoretical maximum rate is 1.6 MIPS in the current design. Actual data
read rate on burst to HUB is 6.4MB/s. Cache is un-tagged and provides no write back ability.

Functionally, the LMM engine is in one COG and the Cache engine takes up 5 COGs. Pretty
sad that the data rate is so low "per COG". Currently, it takes 3 HUB access cycles to manage
the Cache pointers. It takes 2 HUB access cycles for the main Cache reader to cause a long fetch.

The main fetch/execute loop looks like below. The read32 routine for grabbing long after FJMP,
etc... is about the same.

NextX         mov       cPc,    _PC             ' working "cache PC"
              muxc      cflg,   #1              ' save user carry flag state
              cmpsub    cPC,    l2cadr   wc     ' if D >= S C and D = D - S
        if_c  cmp       cPC,    l2size   wc     ' if D < S C
        if_nc jmp       #:loadCache             ' if D >= S load
              add       cPc,    ldatap
              rdlong    :instx, cPc             ' read data from cache
              add       _PC,    #4              ' increment _PC
'             or        outa,   trig            ' debug only
'             andn      outa,   trig            ' debug only ... make sure address is 0
              test      cflg,   #1 wc           ' restore user carry flag state
:instx        nop                               ' execute instruction ....
              jmp       #NextX                  ' 48 cycles

:loadCache     
              wrlong    _PC,    laddrp          ' set l2cache address to start burst read
              mov       l2cadr, _PC
              wrlong    zero,   laddrp          ' clear l2cache address right away
              rdlong    data,   lrdyp           ' wait for cache ready
              cmp       data,   waterline wc
        if_c  jmp       #$-2                    ' if not enough read, repeat
              test      cflg,   #1 wc           ' restore user carry flag state
              jmp       #NextX

The "waterline" set by the cache engine allows the cache reader to operate in parallel with the
execution loop. My conclusion at this point has become: "It's more of a PITA than it's worth"
unless you need it to enhance latched design performance (possible but PITA). Using up so
many cogs (among other things like spending too much time on fruitless endeavor) is a waste.

Other things can be done to optimize, but as a friend once reminded me:
"You can polish a turd for only so long before the **it gets all over your hands."

So with that I'll stop messing with caching and get on to something more productive.

RossH · 2009-12-21 02:51

@jazzed,

Interesting result. Thanks for the update.

The first thing that occurs to me is that if people like us can make something work in 5 cogs, there's got to be a good chance that someone like Chip could make a version that works in only 3 cogs - and at that point it might begin to look worthwhile.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

Cluso99 · 2009-12-21 03:33

Thanks for your results Steve (jazzed).

It is pretty much as I expected, but well worth the exercise. You never know, someone may come back to this later and discover something missed, or better still, a new algorithm.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

jazzed · 2009-12-21 03:52

RossH said...
... someone like Chip could make a version that works in only 3 cogs - and at that point it might begin to look worthwhile.

@RossH,

Yes. Bill Henning has 3 COG code for VGA bursts that might be usable for a cache. It uses a long string of accesses. The big challenge I faced though was getting enough data fast enough for a small "waterline" that would allow the LMM engine to work in parallel with the cache loader engine. I got the latency down to a 3us miss period so that the waterline could be set to 4 longs with some advantage for performance. Yes, there may be others who can solve the problem better. Even the last fetch/execute loop I posted could be optimized maybe.

BTW: That last loop is the closest I could come to a generic interface for a kernel. I have another cache engine that can be used with it, but it is no faster than straight XMM ... which is ok if that's all you need. If you can make use of it, please do.

I've included a Propalyzer screen shot of a cache miss reload and many hits for your amusement. The tick is set for 12.5ns, but it's really reading at 10ns so the numbers at the top are wrong. The whole screen is about 18us instead of the 22us shown. It shows getting a long every 12 instructions with the potential for double that rate since there is waste in the reader cog ... I won't bother with it too much though.

Post Edited (jazzed) : 12/21/2009 4:13:37 AM GMT

Dr_Acula · 2009-12-21 03:57

BigSpin? Yes, that is the idea. This is a real exercise in thinking laterally. It is a problem if it takes lots of cogs to run. Ok, that idea seems to have run into a dead end.

Re sd cards 'wearing out', only writes wear them out, so you can have code on an sd card that you read many times into hub/cogs and that won't wear out an sd card. And/or you can read it once from an sd card into an external ram chip, then read from the ram chip into hub/cogs and that would be faster.

The cog caching is already working. I don't understand *how* it works on the zicog, but I understand it enough to be able to replicate it. Then write some cache type code, copy and paste a skeleton of code that goes at the beginning and end. Then add a reference to it somewhere (actually there are 3 references to add) and then put in the code. It is a bit complicated, but possibly could be automated via an IDE.

Cache code for the cog is being stored in hub, so eventually one might run out of space in hub. But they are just bytes, so you could always bring them in from external ram rather than from hub ram. And the building blocks for that sort of code already exists.

So I think caching for pasm already exists, and just needs to be automated/simplified.

What intrugues me is the possibility of caching in spin. First thing would be spin code that is 'portable', ie it can reside anywhere in memory. For instance in Z80 it is possible to write portable code by never using the JP instruction or the CALL instruction. The only jump instruction you can use is the Jump Relative instruction which jumps back or forward by n bytes. Can you do such a thing in Spin?

Or, you always compile the code for a fixed location somewhere near the top of hub memory. Each object is loaded to this location and run, it cannot be longer than n bytes and only one object can be there at any one time. But fast transfers between hub and external ram mean you can be loading bits of code in and out, and it could be very handy for the routines you only use infrequently - eg a bunch of Spin string manipulation routines.

Still, many bits of code would fit these criteria.

But my brain starts to hurt thinking of how to get code like this to work and how to keep things simple for the compiler. Maybe you have a routine called
PUB Cache

You pass 10 variables and you return 10 variables. That should be enough for most things.

After PUB Cache is some code (probably a single call to a 'ram load' or 'sd load' that copies bytes to the cache location. That might be just a few bytes above this point, so the code then continues on at that point. You could be pecise about that if you knew how long the code was for the bytefill code, or maybe there are some NOPs so it doesn't overlap) You always use the same variable names as those that are passed (variable1, variable2 etc. Or i,j,k for Fortran programmers). Maybe you pass a small array as well, or the location of a small array?

So a question would be whether a compiler could handle the internal counters and flags and pointers for something like a 'repeat' in Spin that was always compiled to a fixed location?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller

Post Edited (Dr_Acula) : 12/21/2009 4:15:37 AM GMT

jazzed · 2009-12-21 04:53

Dr A, one accessory COG is required for the BigSpin VM accessory to handle method dispatch and fetch/update from the modified Spin interpreter assuming that work is completed ..... That leaves 6 cogs free for the user. Another COG could be added for XMM access if one has the luxury to boost performance, but it would not be necessary.

I'm afraid Spin would not be very adequate for implementing VM stuff. No compiler changes would be required, but a compiler like BSTC that generates listings would be necessary for creating lookup information As implementation goes, I've considered building a library and lookup table which as you summarized could go into XMM ... having the library loadable from any media without XMM can also be done, but would of course impact performance. A dynamic load library would be good to put in fast virtual memory. There are good models of this already.

I'm not sure how you ever got the idea that reading an SD card would wear it out [noparse]:)[/noparse] ... not from anything I said I hope. The caching I've explored and described here is for running 32 bit external LMM code faster than by normal one-at-a-time fetch/execute from XMM with potential for application on address latched memory systems ... it has nothing to do with the Z80 emulation which probably has such a feature implemented on RAM-DISC or whatever ... not that I care to know one way or another. You can keep that to yourself [noparse]:)[/noparse]

Dr_Acula · 2009-12-21 05:38

Yes, there are so many possible models around it gets rather confusing to consider which one would work the best, which is easiest to code under, which is easiest to write a compiler for etc.

Re wearing out sd cards, it would depend on the cache model. I think if you have a model where you bring in a group of bytes that represent an object, and you put them in hub and then execute it, then return a value and then throw away that code in hub, well that won't wear out the sd card obviously.

But I'm pondering your intial post "When exactly does it become useful or imperative to start using a swap partition on "fixed" media like SD card?"

If you have a 'swap' partition, it depends on how you implement that. If at the 'block of code' level that you bring into hub that would be fine. But if you are running your memory model at the 'long' size, or at the single instruction size (which ?? I think you are), then if a piece of code in spin or pasm says 'modify value x in memory', then that would modify the value in sd memory. ( If that is the swap partition model you are looking at. )

I think that was the path Windows went down. Early PCs had x amount of ram and then a swap file that sat on the hard drive and emulated more memory. Then you could add a memory chip and get a dramatic improvement. But I think there is a subtle difference here as I think a swap file will work on a physical hard drive but it won't work on a compact flash hard drive or an sd drive.

The zicog dracblade board I'm using has both sd card and 512k of ram, so given a choice, I'd do the swap file in ram rather than on the sd. But you still need the sd to load the initial code (unless you use a bigger eeprom). I found this link that is quite interesting propeller.wikispaces.com/LMM+AiChip+Industries Such a concept would be quite easy to implement on a dracblade board (not using CP/M, just using spin and pasm) as you could move in blocks and execute them one instruction at a time. Interesting that they list certain instructions you can't use and have to replace with calls. (the calls and jumps, as would be expected).

I'm still pondering the merits of models that implement one instruction at a time vs ones that move blocks in, then execute them, then move the next block in.

Yours is a cache model that does things in groups of instructions rather than one at a time, right?

How big do you make a 'group' of instructions?

I suppose one model might be to make the size of a 'block' of memory the same as a cog. Consider a program that had hundreds of little subroutines/objects, that were all cog sized. To run one, you just start a cog and use the ram from hub as per the standard prop system. But what about when you run out of hub ram? How about a model where it boots up, moves 200 cog sized 'cogjects' from sd card into external 512k ram, then you run your program but whenever you start a cog and fill up its memory, you get that code from external 512k ram rather than from hub ram? Maybe this has already been done?

Hmm, if I had to code within those constraints I might write some string routines, but maybe they don't all fit in one cog, so I'd put 10 in one 'cogject' and another 10 in another 'cogject' and compartmentalise the program that way. Then you might need some code that works out which 'cogject' to load. So maybe instead of coding with a long list of PUBs you instead have one PUB but pass it one of a long list of keywords? Certainly a slightly different style of programming. Is that getting close to the idea you had of a library and lookup table?

Would that give you the flexibility to still code spin and pasm together?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller

jazzed · 2009-12-21 06:08

XLMM and BigSpin are two very different ideas ... try to keep them separate.

On XLMM, one has 32 bit instructions that run on a machine similar to what you found in the LMM wiki. These instructions are normally stored in HUB RAM and are fetched and run one at a time by the LMM COG ... this is the basis for Catalina, ICC, and Bill's LAS (and Basic). The jump class of instructions have special needs in this model, but all the other PASM instructions can be executed straight up. Some advantages come from that by having long constants and fast jumps effected by changing the program counter in addition to LMM jumps. This model is very easily moved into one where the code is executed straight from external memory with speeds that are just as fast as SPIN (SPIN is about a 1MIPS machine depending on the instruction).

BigSpin would work on blocks of code rather than a long at a time. Each "library block" would need to be loaded per method into a areas of memory based on an allocation mechanism with statistics to keep the most used methods in HUB ram. The hard part of that model would be accounting for object variable offsets. At first for proof of concept, methods having only local variables and results would be used. The Spin internal stub I showed in the first BigSpin post of this thread places all object and stack information into shared HUB memory for the method dispatcher to use with the loaded blocks. A library method could be run as a PASM block or more conveniently as an LMM block. One could even start another SPIN interpreter variant COG for running Spin library methods, but that would have to be hacked a special way.

I haven't given the swap idea much thought and don't have comments on it.

Added: Forgot to answer this question.

Dr A said...
Yours is a cache model that does things in groups of instructions rather than one at a time, right?

The XLMM cache I was researching reads a block of memory (32 longs today) which is useful for hardware that uses latches, etc.... Before the entire block is read by the cache reader subsystem at the "waterline", the LMM (one-at-a-time execution engine) starts to work running instructions. So, the simple answer is it pulls groups for one at a time execution which starts before the entire group is fetched.

Post Edited (jazzed) : 12/21/2009 6:44:32 AM GMT

Propeller Caching XMM Longs for LMM

Comments