New PUSH and POP functionality added for single hub cycle LMM and other hub transfers

rogloh · 2014-09-07 05:58

Hi all,

So I have been playing about with Quartus II and a DE-0 nano for a couple of days and think I have now figured out the P1V pipeline enough to be dangerous so I've attempted some hopefully useful modifications to the P1V for my first project.

I have created PUSH and POP equivalent instructions for hub reads and writes using the WC modifier flag of the existing PASM RDxxxx and WRxxxx instructions to indicate this extended behaviour. When WC is not present these PASM opcodes continue to behave just as they do today, but whenever WC is present and in addition to the actual hub memory operation taking place, the source register (used as the stack pointer) is modified (pre decremented for pushes, and post incremented for pops) and also written back to COG RAM to keep it pointing to the top of stack. The amount the stack pointer register changes also reflects the size of the argument being pushed/popped to the hub memory stack. The stack is convential and grows downwards in hub memory. This allows easy mixing with COGs running SPIN whose stack(s) all grow upwards, and it also allows very convenient use for LMM purposes. Finally the WC modifier flag doesn't actually go and destroy the Carry flag, it just leaves it unchanged.

This all happens in the same number of clock cycles as the normal reads and writes. I was able to do this without extending the pipeline size by making use of the ALU in the normally idle wait states of the hub memory accesses, indicated by the m[4] state of the P1V code and writing back the modified S result in this period. There are at least 3 m[4] wait state cycles in a given hub memory access while waiting for the hub window to arrive and I only need to use the first one to do the increment/decrement operation on the S register and write it back. So in this way we can nicely get two register writebacks for the instruction, one for S used as the stack pointer and one for D to store the hub read result.

Some examples of the new POP D,SP operation:

RDLONG D, SP WC   ' equivalent to POP LONG D  or  D = LONG[SP], SP+=4
RDWORD D, SP WC   ' equivalent to POP WORD D  or  D = WORD[SP], SP+=2
RDBYTE D, SP WC   ' equivalent to POP BYTE D  or  D = BYTE[SP], SP+=1

Some examples of the new PUSH D,SP operation:

WRLONG D, SP WC   ' equivalent to PUSH LONG D  or  LONG[SP-4] = D, SP-=4
WRWORD D, SP WC   ' equivalent to PUSH WORD D  or  WORD[SP-2] = D, SP-=2
WRBYTE D, SP WC   ' equivalent to PUSH BYTE D  or  BYTE[SP-1] = D, SP-=1

This extended feature should allow tight hub memory copies and finally true single hub cycle LMM loops! It may also be useful for other GCC C language models targeting the P1V whose stack is in HUB memory. As the stack pointer is just any COG register (not dedicated) you can also have multiple stacks if you ever wanted, or switch between them dynamically etc.

eg. a 5 MIP LMM loop @ 80MHz:

instruction    NOP
               RDLONG instruction, PC WC   ' WC flag is set so PC is autoincremented in this instruction
               JMP    #instruction

and a 5 Mega Long/s hub to COG RAM transfer @80 MHz:

loop1  RDLONG dest, sourceAddr WC    ' WC set so read the long and set sourceAddr = sourceAddr+4
       ADD  loop1, _d0               ' increment destination in COG RAM
       DJNZ count, #loop1            ' and loop

and COG to hub memory transfers can similarly be used (but in reverse order):

       ADD    destAddr, #4         ' we should start hub destination one long higher in memory, to compensate for predecrement
loop2  WRLONG source, destAddr WC  ' WC set so write the long at destAddr = destAddr-4
       SUB    loop2, _d0           ' decrement source COG address
       DJNZ   count, #loop2        ' and loop

I've been testing it out - I'm still checking out a few more corner cases so I might well find a bug somewhere but so far I am happy as this is looking like it is basically working out nicely. As this is my first experience ever with Verilog HDL I really don't know how (sub)optimal my code is or how badly it affects timing (total Quartus newbie so I have no idea yet on how to even figure that all out), but at least it does now appear to function as I wanted.

I've attached the original P1V files I modified for anyone else interested in this work. My change is also commented (look for PUSHPOP comments in the file where all my changes went) but to totally understand how it works you'll need to understand the existing pipeline and COG RAM latency etc to truly get it. That's where it all lives. If I (or others trying it out) find any bugs I will attempt to update this code.

Cheers!

Roger

David Betz · 2014-09-07 06:32

Very nice!

rogloh · 2014-09-07 06:39

Thanks, hoping it may come in useful for GCC. Especially when potentially tying it into executing code from the extended 32MB of hub SDRAM (one of my next plans for my DE-0 nano) which is also randomly accessible in a single hub cycle (at least for one COG).

ozpropdev · 2014-09-07 07:04

Cool rogloh!

Bill Henning · 2014-09-07 07:15

Nice!

Very useful.

Can also be used to load 32 bit constants in LMM:

   ' LMM code
   RDLONG dest, PC WC
   LONG const

LMM jump primitive becomes

   ' LMM code
   RDLONG PC, PC

LMM call primitive becomes

   ' LMM code
   JMP  #call
   long addr

...

call    rdlong tmp,pc  wc   ' get destination address, point pc to return address
        wrlong pc,sp  wc    ' push return address
        mov pc, tmp
        jmp  #lmm_loop

LMM return primitive becomes

   ' LMM code
   RDLONG PC, SP WC

David Betz · 2014-09-07 07:28

Bill Henning wrote: »
Nice!

Very useful.

Can also be used to load 32 bit constants in LMM:
   ' LMM code
   RDLONG dest, PC WC
   LONG const
LMM jump primitive bencomes
   ' LMM code
   RDLONG PC, PC
LMM call primitive bencomes
   ' LMM code
   JMP  #call
   long addr

...

call    rdlong tmp,pc  wc   ' get destination address, point pc to return address
        wrlong pc,sp  wc    ' push return address
        mov pc, tmp
        jmp  #lmm_loop 
LMM return primitive bencomes
   ' LMM code
   RDLONG PC, SP WC

That's a great observation.

rogloh · 2014-09-07 08:00

These are some good examples that Bill has provided of what else could now be possible. These approaches will now help keep the hub transfer bandwidth running at full utilization and the LMM code size fairly small and tidy. There's also not much wasted hub memory for all the calls/jumps/returns used, especially given we can start to make good use of the 32 bit addresses if we have more external memory fitted and mapped in above internal hub space (>64kB for example).

rjo__ · 2014-09-07 10:10

I'm on the road with only a kindle to play with. Last night I wrote a lengthy question about accessing the hub with fewer clocks using auto increment. My idea was to use USR0_3 to do it. My question got lost because I hit the wrong button on my kindle. Today I see that you have post 95 percent of the solution.

My interest is. Improving bandwidth for digital cameras.

What happens in native PASM if the wrlong is issued with a wc ? Is the z flag also available ?

jac_goudsmit · 2014-09-07 10:59

Rogloh, I'm interested in integrating this feature into the Github repo (it may take a few days before I get to it, ideas are stacking up in my virtual "inbox").

If you have a Github account, can you fork my Altera or Master branch and make the change there, and post a pull request? Or, alternatively, would you mind if I merge it in there on your behalf?

Thanks!

===Jac

Bill Henning · 2014-09-07 11:00

Regarding hub bandwidth...

I have been wondering if it may be practical to reduce the hub access slots to every 8 clock cycles (instead of 16) by accepting a small drop in fMax.

Given how fast the block ram on the fpga's tends to be, I suspect this could be feasible.

I realize this would not help LMM, but it would help hubexec, and by adding some form of a REP instruction, it would also help loading/saving a lot of longs to/from a cog. Food for thought.

Cluso99 · 2014-09-07 12:21

Absolutely fantastic rogloh!
Bill, brilliant observations!

These changes will more than double LMMs performance, while at the same time reducing hub usage slightly too.

I too have been wanting to double hub throughput by making single cyclehub access. But I would rather retain the 1:16 hub access and add 8 more cogs. I would like the VGA limited to only 2 cogs in this scenario. In 2 cogs without vga, add 2 flexible uarts, in another 2 cogs add 2 more basic counters. I forget what I thought for the other 2 cogs without vga. The new 8 cogs would not have vga, nor other additions.
At least 1 ot 2 cogs would have 8KB cog ram, but not sure which ones.

As you can see, I no longer believe all cogs must be identical. For way too long we have been expanding cog functions to every cog, even though we recognise only 1 or 2 cogs require these features.

There are some instruction space available in the hubop opcode space. Without getting carried away with new instructions...
ADDD D,#n
SUBD D,#n
where #n is 3 or 4 bits
the instruction at D will have its D field incremented/decremented by #n (typically by 1/2/4). This would limit the Dfield inc/dec to 9 bits with overflow truncated (instead of modifying the cccc bits by accident).
The WC modifier couldbe used to inc/dec the S field by #n? also.

Cluso99 · 2014-09-07 15:46

rogloh,
I have been trying to understand the m[4] state. You have obviously determined how it works, so a question...

The m[2] state outputs "px" on the ram address lines. Normally this is followed by
the m[3] state which immediately latches the "I" from the ram ( always @(posedge clk_cog) )

However, when
the m[2] state outputs "px" on the ram address lines, and is followed by m[4] wait state(s),
the m[4] state outputs "i(D)" on the ram address lines. So when wait ends,this is followed by
the m[3] state which immediately latches the "I" from the ram ( always @(posedge clk_cog) )

This means that the m[3] would incorrectly latch "i(D)" instead of "I"via "px" address.
What am I missing??? (because it works)

rogloh · 2014-09-07 16:31

Cluso99 wrote: »

rogloh,
I have been trying to understand the m[4] state. You have obviously determined how it works, so a question...

The m[2] state outputs "px" on the ram address lines. Normally this is followed by
the m[3] state which immediately latches the "I" from the ram ( always @(posedge clk_cog) )

However, when
the m[2] state outputs "px" on the ram address lines, and is followed by m[4] wait state(s),
the m[4] state outputs "i(D)" on the ram address lines. So when wait ends,this is followed by
the m[3] state which immediately latches the "I" from the ram ( always @(posedge clk_cog) )

This means that the m[3] would incorrectly latch "i(D)" instead of "I"via "px" address.
What am I missing??? (because it works)

I know just what you mean. That same issue got me for a while while I was debugging my solution. It comes down to the fact that in m[4] the COG RAM is actually not enabled (see ram_ena), so it won't change its Q output from the last accessed address of px from back in m[2]. The COG RAM is registered on its outputs. I found I needed to draw out some timing diagrams on paper showing exactly when each bus changes and when data is latched to be able to decipher it.

rogloh · 2014-09-07 16:34

jac_goudsmit wrote: »

Rogloh, I'm interested in integrating this feature into the Github repo (it may take a few days before I get to it, ideas are stacking up in my virtual "inbox").

If you have a Github account, can you fork my Altera or Master branch and make the change there, and post a pull request? Or, alternatively, would you mind if I merge it in there on your behalf?

Thanks!

===Jac

Hi Jac, I don't use github at the moment. You are certainly welcome to merge it in.

rogloh · 2014-09-07 18:04

rjo__ wrote: »

What happens in native PASM if the wrlong is issued with a wc ? Is the z flag also available ?

Good question... the datasheet implies C is unchanged after the RDxxxx/WRxxxx opcodes, however in testing on a real P1 HW developer board just now, I found it was actually passing carry through from the hub bus, still checking for more free cogs (C=0) by the looks of the Verlog code.

So even though I have slightly modified this original undocumented behavior of WC with my change, I highly doubt people were or will now be using these opcodes with WC just for that side effect of returning whether we have free cogs or not. Free COG status is still returned in the official HUBOP forms of COGINIT/COGSTOP etc which I didn't modify.

The WZ flag I didn't touch yet - WZ is still very useful for RDLONG polling some status location in hub memory within tight loops. However when WC is actually combined with WZ I was considering making a separate version using it to decode this and write the adjusted S pointer to a fixed LINK REGISTER in COG RAM (maybe at address 0 for example).

In this way we could have LMM code like this...

LMM JUMPS would just use WC

RDLONG PC, PC WC
          LONG JMPTARGET

LMM CALLS would use both WC WZ, and then behave somewhat like JMPRET but with D register fixed at 0 (for example).

           RDLONG PC, PC WC WZ  ' PC<=CALLTARGET,  LR=PC+4
           LONG CALLTARGET
           
           ... 

CALLTARGET
           WRLONG LR, SP WC  ' optionally PUSH LR if this is not a leaf function (ie. we will be calling other functions)

           ' ... FUNCTION CODE HERE

           RDLONG PC, SP WC  '  to return, optionally POP the return address from the stack if this is not a leaf function
                ' OR
           MOV PC, LR  ' RET    ' alternatively just copy our return address in LR back to the PC  (faster)

If this idea works along with what Bill showed above, we may never need to exit the LMM loop for anything. We'd have 32 bit loads, jumps and calls (leaf/non-leaf), all staying in the LMM loop, running as fast as possible.

Bill Henning · 2014-09-07 20:39

Thanks, I forgot to show the "LONG JMPTARGET" when describing using "RDLONG PC,PC WC" as the LMM jump... as my wife keeps reminding me, I am grey haired and old

Re/ CALL functionality

We need to preserve the WZ behaviour as it is very useful to be able to test the byte/word/long just read for zero, mailboxes depend on it, *BUT*

There is no reason why "RDLONG PC,PC WC" could not write to a pre-defined LR every time!

rogloh · 2014-09-07 20:53

Do you think that WZ behavior needs to be preserved also when you are pushing/popping with WC? I'm not sure as you can still do the WZ without the WC to keep the existing behavior for mailbox polling. I doub't we'll be polling at the same time as changing the stack with WC. Anyway I won't be modifying WZ behavior there when WC is not passed.

Actually I am compiling up the WZ WC combined form now for LR (turns out to be a trivial change) to test it.

Bill Henning · 2014-09-07 21:04

Yes, I believe WZ behaviour must be preserved.

I think it should also be preserved for push/pop variants (ie wc specified)

rogloh wrote: »

Do you think that WZ behavior needs to be preserved also when you are pushing/popping with WC? I'm not sure as you can still do the WZ without the WC to keep the existing behavior for mailbox polling. I doub't we'll be polling at the same time as changing the stack with WC. Anyway I won't be modifying WZ behavior there when WC is not passed.

Actually I am compiling up the WZ WC combined form now for LR (turns out to be a trivial change) to test it.

Cluso99 · 2014-09-07 21:05

rogloh wrote: »

Do you think that WZ behavior needs to be preserved also when you are pushing/popping with WC? I'm not sure as you can still do the WZ without the WC to keep the existing behavior for mailbox polling. I doub't we'll be polling at the same time as changing the stack with WC. Anyway I won't be modifying WZ behavior there when WC is not passed.

Actually I am compiling up the WZ WC combined form now for LR (turns out to be a trivial change) to test it.

Yes, I think using the special case where WC is used, the WZ can be used to write to a fixed LR cog location. In this instance, would the incrementing/decrementing of the pointer need to be done??? If so, is there another spare m[4] state in the loop???
BTW I suggest $1EF or similar because sometimes we build jump tables starting at cog $000. This also ws discussed in the P2 threads.

Cluso99 · 2014-09-07 21:19

Roger,
Here is my latest understanding of the states. Can you see any problems???
I located a problem in my pasm code but I still have some inconsistencies in my AUGxx instructions.

State Diagram:
                      
 &#9524;  &#9532;     &#9492;&#61627;&#9488;-----------------------------------------------&#9523;--------------------------------------------------------
    &#9474;       &#9474;  m[0]:  &#61600;                                     &#9492;&#9472;{ram<=alu_r}      &#9472;&#9472;&#9472;&#9472;Latch R (write result to Cog RAM)
    S     &#9484;&#9472;&#9496;         &#9474;                                                                                              
    &#9474;     &#9474;           &#61602;  adr=sa|i(s)       &#9472;&#9472;&#9472;&#9472;Address S                                                
    &#9532;     &#9492;&#61627;&#9488;-----------------------------------------------&#9523;--------------------------------------------------------               
    &#9474;       &#9474;  m[1]:  &#61600;                                     &#9492;&#9472;sy<=ram           &#9472;&#9472;&#9472;&#9472;Latch S                          
    D     &#9484;&#9472;&#9496;         &#9474;                                                                                              
    &#9474;     &#9474;           &#61602;  adr=da|i(d)       &#9472;&#9472;&#9472;&#9472;Address D                                               
    &#9532;  &#9516;  &#9492;&#61627;&#9488;-----------------------------------------------&#9523;--------------------------------------------------------               
    &#9474;  &#9474;    &#9474;  m[2]:  &#61600;                                     &#9507;&#9472;d<=ram            &#9472;&#9472;&#9472;&#9472;Latch D                          
    e  I  &#9484;&#9472;&#9496;         &#9474;                                     &#9492;&#9472;s<=sx{sa/../s     &#9472;&#9472;&#9472;&#9472;Latch S2                         
    &#9474;  &#9474;  &#9474;           &#61602;  adr=px            &#9472;&#9472;&#9472;&#9472;Address I                                                                 
    &#9532;  &#9532;  &#9492;&#61627;&#9488;===============================================&#9523;========================================================
    &#9474;  &#9474;    &#9474;  m[4]:  &#61600;                                     &#9492;&#9472;match<=...        &#9472;&#9472;&#9472;&#9472;Latch pin/cnt match (WAITxxx)? 
   (w)(w) &#9484;&#9472;&#9496;(option) &#9474;                                                                                              
    &#9474;  &#9474;  &#9474;           &#61602;  adr=da|i(d)       &#9472;&#9472;&#9472;&#9472;Address D                       
    &#9532;  &#9532;  &#9492;&#61627;&#9488;===============================================&#9523;========================================================
    &#9474;  &#9474;    &#9474;  m[3]:  &#61600;                                     &#9507;&#9472;run<=0/1          &#9472;&#9472;&#9472;&#9472;Latch run                  
    R  d  &#9484;&#9472;&#9496;         &#9474;                                     &#9507;&#9472;pc<=px++          &#9472;&#9472;&#9472;&#9472;Latch pc++
    &#9474;  &#9474;  &#9474;           &#9474;                                     &#9507;&#9472;ix<=ram           &#9472;&#9472;&#9472;&#9472;Latch I (next instruction)                     
    &#9474;  &#9474;  &#9474;           &#9474;                                     &#9507;&#9472;sa<=ix/0          &#9472;&#9472;&#9472;&#9472;Latch sa =AUG(S) 
    &#9474;  &#9474;  &#9474;           &#9474;                                     &#9507;&#9472;da<=ix/0          &#9472;&#9472;&#9472;&#9472;Latch da =AUG(D) 
    &#9474;  &#9474;  &#9474;           &#9474;                                     &#9507;&#9472;cancel<=...       &#9472;&#9472;&#9472;&#9472;Latch cancel
    &#9474;  &#9474;  &#9474;           &#9474;                                     &#9507;&#9472;c<=alu_co         &#9472;&#9472;&#9472;&#9472;Latch c
    &#9474;  &#9474;  &#9474;           &#9474;                                     &#9492;&#9472;z<=alu_zo         &#9472;&#9472;&#9472;&#9472;Latch z
    &#9474;  &#9474;  &#9474;           &#9474;  adr=da|i(d)       &#9472;&#9472;&#9472;&#9472;Address D                                         
    &#9474;  &#9474;  &#9474;           &#61602;  wrt=alu_wr        &#9472;&#9472;&#9472;&#9472;Write Strobe                      
    &#9524;  &#9532;  &#9492;&#61627;&#9488;-----------------------------------------------&#9523;--------------------------------------------------------              
                                                                    
adr:  ram_a                                                         
ram:  ram_q                                                         
wrt:  ram_w

rogloh · 2014-09-07 21:27

Yeah I can put it anywhere, including $1EF, we should all come to some consensus on this. I had only indicated 0 for a couple of reasons, one you can easily remember that the "WZ" modifier means write to "zero" as the Link register, and also it might be setup for simplicity such that GCC keeps all its registers in low memory at COG RAM offsets mapping to their register numbers. R0=LR at 0, R1 at 1, R2 at 0 etc. Doesn't have to of course. Also Register 0 on VM's is sometimes useful for storing a constant 0 and not a link register. I do understand that keeping zero free can be useful for other purposes such as indexed tables.

We have heaps of COG registers now free on an LMM VM using this single hub cycle model. Particularly now we can avoid separate LMM code stubs for loading 32 constants into different registers and branching, most of the COG could be kept free (or for some type of cache purposes???). Our VM could in theory now have hundreds of instead of 1 to 2 dozen registers or so. Don't know how well GCC deals with large register spaces. May come in useful for argument passing, deep internal stack frames, and saving pushes/pops, but managing it all is best left to the compiler no doubt.

rogloh · 2014-09-07 21:36

Bill Henning wrote: »

I think it should also be preserved for push/pop variants (ie wc specified)

Hmm. You and Cluso have differing opinions there. I can see perhaps it could be useful to be able to simultaneously POP and set Z based on the popped result. It could save a cycle on result checking, but the big problem is that if we do any pop anywhere in a leaf function, it will then automatically clobber our LR, rendering it useless. We need to keep WC WZ reserved for POP and LR writing exclusively, and use WC only for general POPs.

rogloh · 2014-09-07 21:49

Cluso99 wrote: »

Roger,
Here is my latest understanding of the states. Can you see any problems???
I located a problem in my pasm code but I still have some inconsistencies in my AUGxx instructions.

I think you might want to show when exactly the data bus outputs from COG RAM is latched into registers and link them back to edges. For example you do list in m[2] that D is latched, but it is actually latched right at the end of m[2] just before m[4] or m[3]. There is one cycle of latency from latching the COG RAM address presented earlier to latching back its the returned data on ram_q. If you draw a couple more lines to the clock boundaries as to when these buses change and their intermediate results it might be clearer. That's what I did to get my head around it when I wanted to change it and when it wasn't behaving as I expected. You also have sy being latched on the boundary of m0 to m1, but it is latched one clock cycle later at the clock occurring at the end of m[1] for example.

Cluso99 · 2014-09-07 23:23

rogloh wrote: »

Hmm. You and Cluso have differing opinions there. I can see perhaps it could be useful to be able to simultaneously POP and set Z based on the popped result. It could save a cycle on result checking, but the big problem is that if we do any pop anywhere in a leaf function, it will then automatically clobber our LR, rendering it useless. We need to keep WC WZ reserved for POP and LR writing exclusively, and use WC only for general POPs.

The RDxxxx WZ absolutely needs to be preserved. This is used extensively to wait on hub flags.

However, when using WC to inc/dec the stack, I would have thought the WZ modifier could be used to designate the LR write.

BTW, doesn't this actually mean that the stack is NOT incremented/decremented under these circumstances???
Or is the stack pointer decremented by the leaf function (which can be a SUB SP,#1/2/4)???

rogloh · 2014-09-07 23:33

I have tested out some of the Link Register functionality I proposed above using register address $1EF as the LR. It seems to be working. I will do a few more tests then update the zip file in the first post (or better yet add another separate zipfile with this variant of my earlier work).

For anyone keen to get these LR changes early, it's as simple as just doing this..

Replace this code in my last cog.v file

wire [8:0] ram_a	= m[2] && !pushpop || m[4]	&& pushpop && !firstwaitcycle ? px 
					: m[0] || m[4] && firstwaitcycle ? i[sh:sl] 
							: i[dh:dl];

with this:

wire [8:0] ram_a	= m[2] && !pushpop || m[4]	&& pushpop && !firstwaitcycle ? px 
					: m[0] || m[4] && firstwaitcycle && !(i[wz] && i[wr]) ? i[sh:sl] 
					: m[4] && firstwaitcycle && i[wz] && i[wr]? 9'h1ef // write to "LR" register at $1EF if WZ and WC are set on POPS
							: i[dh:dl];

An example of its use is for calls

RDLONG PC, PC  WC WZ   ' PC<=LONG[PC] = CALLTARGET,  LR = PC+4 (ie. the address following LONG storage of CALLTARGET) ,  LR is fixed at address $1ef in COG RAM
LONG CALLTARGET

To return from this call directly if you haven't called any other functions, simple and fast.

MOV PC, $1EF

If you have pushed the LR onto the stack at the beginning of your called function, you return using a pop of the LR, simple, just takes the one extra hub cycle.

RDLONG PC, SP WC

Update: This code works fine when the source and destination are the same in the case of doing a CALL, but does not update the PC/SP when the source is not the same as the destination, only LR gets the updated value. This might be okay, or we may want a second write in the m[4] cycle so we get both updated. I doubt we need the LR write functionality when the destination is not the PC in the RDLONG. It isn't much use otherwise. So maybe we can trigger the LR write not just on all RDLONG D, S WC WZ but only when D=S and WC, WZ are both present, and behave normally otherwise. This could help preserve the WZ behavior Bill is interested in too for the regular pop operations.

rogloh · 2014-09-07 23:34

Cluso99 wrote: »

The RDxxxx WZ absolutely needs to be preserved. This is used extensively to wait on hub flags.

However, when using WC to inc/dec the stack, I would have thought the WZ modifier could be used to designate the LR write.

BTW, doesn't this actually mean that the stack is NOT incremented/decremented under these circumstances???
Or is the stack pointer decremented by the leaf function (which can be a SUB SP,#1/2/4)???

The stack pointer/program counter is still incremented when you do a RDLONG xxx, yyy WC WZ. This is so it skips past the stored 32 bit long used as the call target making it perfect for the LR.

As explained above I am keeping the WZ behavior the same when WC is not present. I know how it is used for mailbox polling, it's ideal for that.

Cluso99 · 2014-09-07 23:39

rogloh wrote: »

I think you might want to show when exactly the data bus outputs from COG RAM is latched into registers and link them back to edges. For example you do list in m[2] that D is latched, but it is actually latched right at the end of m[2] just before m[4] or m[3]. There is one cycle of latency from latching the COG RAM address presented earlier to latching back its the returned data on ram_q. If you draw a couple more lines to the clock boundaries as to when these buses change and their intermediate results it might be clearer. That's what I did to get my head around it when I wanted to change it and when it wasn't behaving as I expected. You also have sy being latched on the boundary of m0 to m1, but it is latched one clock cycle later at the clock occurring at the end of m[1] for example.

Originally I thought sy was latched at the end of m[1].
But when looking at this, the m[0] cycle has the ram address set to "S" and it is latched at the end of the m[0] cycle, which is also the start of the m[1] cycle. The verilog says it is latched at posedge of clk_cog and m[2]. This caused me to rethink it and decide that the latching had to occur at the start of m[2] instead of the end of m[2].
Chip describes the states with m[0] being the S, m[1] being the D, m[2] being next I fetch and execute, with m[3] being write result. This made sense to me when looking at the latching occurring on the leading edge of the next state.
The conflict in this that I saw was the m states advance on the posedge of clk_cog. Therefore, I changed my mind and then discovered that the address on the ram was wrong.

So, do you have anything definitive that can help???

rogloh · 2014-09-07 23:50

Cluso99 wrote: »

Originally I thought sy was latched at the end of m[1].
But when looking at this, the m[0] cycle has the ram address set to "S" and it is latched at the end of the m[0] cycle, which is also the start of the m[1] cycle. The verilog says it is latched at posedge of clk_cog and m[2]. This caused me to rethink it and decide that the latching had to occur at the start of m[2] instead of the end of m[2].
Chip describes the states with m[0] being the S, m[1] being the D, m[2] being next I fetch and execute, with m[3] being write result. This made sense to me when looking at the latching occurring on the leading edge of the next state.
The conflict in this that I saw was the m states advance on the posedge of clk_cog. Therefore, I changed my mind and then discovered that the address on the ram was wrong.

So, do you have anything definitive that can help???

My only advice is don't believe all the comments in the Verilog regarding what exactly m[0], m[1], m[2] mean. I found that was very confusing and doesn't take into account lookup result latencies from the COG RAM. Best to draw it out with all the buses and map the states to what they do yourself. Remember all the setup is done at the start of the state in preparation for the next clock edge at which time everything updates and that the COG RAM is registered so there is an additional delay cycle from when you present the address to when you get the result.

Bill Henning · 2014-09-08 06:38

How about using "not write result" to indicate load LR?

RDLONG pc,pc wc nr ' would be LMM CALL that loads LR with incremented PC

RDLONG pc,pc wc ' would be LMM JMP, does not load LR

I don't think we need the 'NR' version of RDLONG's when WC=1 (or ever really, as condition codes can be used to stop it from executing)

This preserves the WZ attribute for all variations, and still allows selecting LR behaviour.

The assembler could be modified to make 'LR' be a synonym for 'NR'

rogloh · 2014-09-08 07:21

Bill Henning wrote: »

How about using "not write result" to indicate load LR?

RDLONG pc,pc wc nr ' would be LMM CALL that loads LR with incremented PC

RDLONG pc,pc wc ' would be LMM JMP, does not load LR

I don't think we need the 'NR' version of RDLONG's when WC=1 (or ever really, as condition codes can be used to stop it from executing)

This preserves the WZ attribute for all variations, and still allows selecting LR behaviour.

The assembler could be modified to make 'LR' be a synonym for 'NR'

I had thought about that too but it can't work as the NR bit already distinguishes between RDLONGs and WRLONGs. A RDLONG with NR is actually interpreted as a WRLONG in PASM today based on the instruction encodings. So right now I coded it up such that the extra LR loading step is triggered when D=S (with S not immediate) and WC WZ are both present.
eg.

RDLONG PC, PC WC, WZ

This form loads LR, but it would not modify Z. All other combinations of RDLONG/WRLONG etc involving WZ would still update Z as they do today. Now I cannot imagine a need to have the Z flag set in this case when we are jumping to a new PC read in from the hub code and it is a zero value, so it's probably the best way to go to get the best outcome with new LR functionality included.

Bill Henning · 2014-09-08 07:46

Yep, I am definitely getting old.

I forgot that NR distinguishes between reads & writes! Thanks guys.

Ok, you have convinced me

but only when D=PC, so perhaps (which is the only useful case for CALL/JMP anyway)

$1EF: PC
$1EE: LR

The reason is WZ is very useful when scanning arrays or following lists

consider

rdlong myptr, listptr wc

where we load pointers from an array of pointers pointed to by listptr, and increment listptr on every load

This would be used in keyword lookupgs etc

New PUSH and POP functionality added for single hub cycle LMM and other hub transfers

Comments