New PUSH and POP functionality added for single hub cycle LMM and other hub transfers
rogloh
Posts: 5,791
Hi all,
So I have been playing about with Quartus II and a DE-0 nano for a couple of days and think I have now figured out the P1V pipeline enough to be dangerous so I've attempted some hopefully useful modifications to the P1V for my first project.
I have created PUSH and POP equivalent instructions for hub reads and writes using the WC modifier flag of the existing PASM RDxxxx and WRxxxx instructions to indicate this extended behaviour. When WC is not present these PASM opcodes continue to behave just as they do today, but whenever WC is present and in addition to the actual hub memory operation taking place, the source register (used as the stack pointer) is modified (pre decremented for pushes, and post incremented for pops) and also written back to COG RAM to keep it pointing to the top of stack. The amount the stack pointer register changes also reflects the size of the argument being pushed/popped to the hub memory stack. The stack is convential and grows downwards in hub memory. This allows easy mixing with COGs running SPIN whose stack(s) all grow upwards, and it also allows very convenient use for LMM purposes. Finally the WC modifier flag doesn't actually go and destroy the Carry flag, it just leaves it unchanged.
This all happens in the same number of clock cycles as the normal reads and writes. I was able to do this without extending the pipeline size by making use of the ALU in the normally idle wait states of the hub memory accesses, indicated by the m[4] state of the P1V code and writing back the modified S result in this period. There are at least 3 m[4] wait state cycles in a given hub memory access while waiting for the hub window to arrive and I only need to use the first one to do the increment/decrement operation on the S register and write it back. So in this way we can nicely get two register writebacks for the instruction, one for S used as the stack pointer and one for D to store the hub read result.
Some examples of the new POP D,SP operation:
Some examples of the new PUSH D,SP operation:
This extended feature should allow tight hub memory copies and finally true single hub cycle LMM loops! It may also be useful for other GCC C language models targeting the P1V whose stack is in HUB memory. As the stack pointer is just any COG register (not dedicated) you can also have multiple stacks if you ever wanted, or switch between them dynamically etc.
eg. a 5 MIP LMM loop @ 80MHz:
and COG to hub memory transfers can similarly be used (but in reverse order):
I've been testing it out - I'm still checking out a few more corner cases so I might well find a bug somewhere but so far I am happy as this is looking like it is basically working out nicely. As this is my first experience ever with Verilog HDL I really don't know how (sub)optimal my code is or how badly it affects timing (total Quartus newbie so I have no idea yet on how to even figure that all out), but at least it does now appear to function as I wanted.
I've attached the original P1V files I modified for anyone else interested in this work. My change is also commented (look for PUSHPOP comments in the file where all my changes went) but to totally understand how it works you'll need to understand the existing pipeline and COG RAM latency etc to truly get it. That's where it all lives. If I (or others trying it out) find any bugs I will attempt to update this code.
Cheers!
Roger
So I have been playing about with Quartus II and a DE-0 nano for a couple of days and think I have now figured out the P1V pipeline enough to be dangerous so I've attempted some hopefully useful modifications to the P1V for my first project.
I have created PUSH and POP equivalent instructions for hub reads and writes using the WC modifier flag of the existing PASM RDxxxx and WRxxxx instructions to indicate this extended behaviour. When WC is not present these PASM opcodes continue to behave just as they do today, but whenever WC is present and in addition to the actual hub memory operation taking place, the source register (used as the stack pointer) is modified (pre decremented for pushes, and post incremented for pops) and also written back to COG RAM to keep it pointing to the top of stack. The amount the stack pointer register changes also reflects the size of the argument being pushed/popped to the hub memory stack. The stack is convential and grows downwards in hub memory. This allows easy mixing with COGs running SPIN whose stack(s) all grow upwards, and it also allows very convenient use for LMM purposes. Finally the WC modifier flag doesn't actually go and destroy the Carry flag, it just leaves it unchanged.
This all happens in the same number of clock cycles as the normal reads and writes. I was able to do this without extending the pipeline size by making use of the ALU in the normally idle wait states of the hub memory accesses, indicated by the m[4] state of the P1V code and writing back the modified S result in this period. There are at least 3 m[4] wait state cycles in a given hub memory access while waiting for the hub window to arrive and I only need to use the first one to do the increment/decrement operation on the S register and write it back. So in this way we can nicely get two register writebacks for the instruction, one for S used as the stack pointer and one for D to store the hub read result.
Some examples of the new POP D,SP operation:
RDLONG D, SP WC ' equivalent to POP LONG D or D = LONG[SP], SP+=4 RDWORD D, SP WC ' equivalent to POP WORD D or D = WORD[SP], SP+=2 RDBYTE D, SP WC ' equivalent to POP BYTE D or D = BYTE[SP], SP+=1
Some examples of the new PUSH D,SP operation:
WRLONG D, SP WC ' equivalent to PUSH LONG D or LONG[SP-4] = D, SP-=4 WRWORD D, SP WC ' equivalent to PUSH WORD D or WORD[SP-2] = D, SP-=2 WRBYTE D, SP WC ' equivalent to PUSH BYTE D or BYTE[SP-1] = D, SP-=1
This extended feature should allow tight hub memory copies and finally true single hub cycle LMM loops! It may also be useful for other GCC C language models targeting the P1V whose stack is in HUB memory. As the stack pointer is just any COG register (not dedicated) you can also have multiple stacks if you ever wanted, or switch between them dynamically etc.
eg. a 5 MIP LMM loop @ 80MHz:
instruction NOP RDLONG instruction, PC WC ' WC flag is set so PC is autoincremented in this instruction JMP #instructionand a 5 Mega Long/s hub to COG RAM transfer @80 MHz:
loop1 RDLONG dest, sourceAddr WC ' WC set so read the long and set sourceAddr = sourceAddr+4 ADD loop1, _d0 ' increment destination in COG RAM DJNZ count, #loop1 ' and loop
and COG to hub memory transfers can similarly be used (but in reverse order):
ADD destAddr, #4 ' we should start hub destination one long higher in memory, to compensate for predecrement loop2 WRLONG source, destAddr WC ' WC set so write the long at destAddr = destAddr-4 SUB loop2, _d0 ' decrement source COG address DJNZ count, #loop2 ' and loop
I've been testing it out - I'm still checking out a few more corner cases so I might well find a bug somewhere but so far I am happy as this is looking like it is basically working out nicely. As this is my first experience ever with Verilog HDL I really don't know how (sub)optimal my code is or how badly it affects timing (total Quartus newbie so I have no idea yet on how to even figure that all out), but at least it does now appear to function as I wanted.
I've attached the original P1V files I modified for anyone else interested in this work. My change is also commented (look for PUSHPOP comments in the file where all my changes went) but to totally understand how it works you'll need to understand the existing pipeline and COG RAM latency etc to truly get it. That's where it all lives. If I (or others trying it out) find any bugs I will attempt to update this code.
Cheers!
Roger
zip
7K
Comments
Very useful.
Can also be used to load 32 bit constants in LMM:
LMM jump primitive becomes
LMM call primitive becomes
LMM return primitive becomes
My interest is. Improving bandwidth for digital cameras.
What happens in native PASM if the wrlong is issued with a wc ? Is the z flag also available ?
If you have a Github account, can you fork my Altera or Master branch and make the change there, and post a pull request? Or, alternatively, would you mind if I merge it in there on your behalf?
Thanks!
===Jac
I have been wondering if it may be practical to reduce the hub access slots to every 8 clock cycles (instead of 16) by accepting a small drop in fMax.
Given how fast the block ram on the fpga's tends to be, I suspect this could be feasible.
I realize this would not help LMM, but it would help hubexec, and by adding some form of a REP instruction, it would also help loading/saving a lot of longs to/from a cog. Food for thought.
Bill, brilliant observations!
These changes will more than double LMMs performance, while at the same time reducing hub usage slightly too.
I too have been wanting to double hub throughput by making single cyclehub access. But I would rather retain the 1:16 hub access and add 8 more cogs. I would like the VGA limited to only 2 cogs in this scenario. In 2 cogs without vga, add 2 flexible uarts, in another 2 cogs add 2 more basic counters. I forget what I thought for the other 2 cogs without vga. The new 8 cogs would not have vga, nor other additions.
At least 1 ot 2 cogs would have 8KB cog ram, but not sure which ones.
As you can see, I no longer believe all cogs must be identical. For way too long we have been expanding cog functions to every cog, even though we recognise only 1 or 2 cogs require these features.
There are some instruction space available in the hubop opcode space. Without getting carried away with new instructions...
ADDD D,#n
SUBD D,#n
where #n is 3 or 4 bits
the instruction at D will have its D field incremented/decremented by #n (typically by 1/2/4). This would limit the Dfield inc/dec to 9 bits with overflow truncated (instead of modifying the cccc bits by accident).
The WC modifier couldbe used to inc/dec the S field by #n? also.
I have been trying to understand the m[4] state. You have obviously determined how it works, so a question...
The m[2] state outputs "px" on the ram address lines. Normally this is followed by
the m[3] state which immediately latches the "I" from the ram ( always @(posedge clk_cog) )
However, when
the m[2] state outputs "px" on the ram address lines, and is followed by m[4] wait state(s),
the m[4] state outputs "i(D)" on the ram address lines. So when wait ends,this is followed by
the m[3] state which immediately latches the "I" from the ram ( always @(posedge clk_cog) )
This means that the m[3] would incorrectly latch "i(D)" instead of "I"via "px" address.
What am I missing??? (because it works)
I know just what you mean. That same issue got me for a while while I was debugging my solution. It comes down to the fact that in m[4] the COG RAM is actually not enabled (see ram_ena), so it won't change its Q output from the last accessed address of px from back in m[2]. The COG RAM is registered on its outputs. I found I needed to draw out some timing diagrams on paper showing exactly when each bus changes and when data is latched to be able to decipher it.
Good question... the datasheet implies C is unchanged after the RDxxxx/WRxxxx opcodes, however in testing on a real P1 HW developer board just now, I found it was actually passing carry through from the hub bus, still checking for more free cogs (C=0) by the looks of the Verlog code.
So even though I have slightly modified this original undocumented behavior of WC with my change, I highly doubt people were or will now be using these opcodes with WC just for that side effect of returning whether we have free cogs or not. Free COG status is still returned in the official HUBOP forms of COGINIT/COGSTOP etc which I didn't modify.
The WZ flag I didn't touch yet - WZ is still very useful for RDLONG polling some status location in hub memory within tight loops. However when WC is actually combined with WZ I was considering making a separate version using it to decode this and write the adjusted S pointer to a fixed LINK REGISTER in COG RAM (maybe at address 0 for example).
In this way we could have LMM code like this...
LMM JUMPS would just use WC
LMM CALLS would use both WC WZ, and then behave somewhat like JMPRET but with D register fixed at 0 (for example).
If this idea works along with what Bill showed above, we may never need to exit the LMM loop for anything. We'd have 32 bit loads, jumps and calls (leaf/non-leaf), all staying in the LMM loop, running as fast as possible.
Re/ CALL functionality
We need to preserve the WZ behaviour as it is very useful to be able to test the byte/word/long just read for zero, mailboxes depend on it, *BUT*
There is no reason why "RDLONG PC,PC WC" could not write to a pre-defined LR every time!
Actually I am compiling up the WZ WC combined form now for LR (turns out to be a trivial change) to test it.
I think it should also be preserved for push/pop variants (ie wc specified)
BTW I suggest $1EF or similar because sometimes we build jump tables starting at cog $000. This also ws discussed in the P2 threads.
Here is my latest understanding of the states. Can you see any problems???
I located a problem in my pasm code but I still have some inconsistencies in my AUGxx instructions.
We have heaps of COG registers now free on an LMM VM using this single hub cycle model. Particularly now we can avoid separate LMM code stubs for loading 32 constants into different registers and branching, most of the COG could be kept free (or for some type of cache purposes???). Our VM could in theory now have hundreds of instead of 1 to 2 dozen registers or so. Don't know how well GCC deals with large register spaces. May come in useful for argument passing, deep internal stack frames, and saving pushes/pops, but managing it all is best left to the compiler no doubt.
Hmm. You and Cluso have differing opinions there. I can see perhaps it could be useful to be able to simultaneously POP and set Z based on the popped result. It could save a cycle on result checking, but the big problem is that if we do any pop anywhere in a leaf function, it will then automatically clobber our LR, rendering it useless. We need to keep WC WZ reserved for POP and LR writing exclusively, and use WC only for general POPs.
However, when using WC to inc/dec the stack, I would have thought the WZ modifier could be used to designate the LR write.
BTW, doesn't this actually mean that the stack is NOT incremented/decremented under these circumstances???
Or is the stack pointer decremented by the leaf function (which can be a SUB SP,#1/2/4)???
For anyone keen to get these LR changes early, it's as simple as just doing this..
Replace this code in my last cog.v file
with this:
An example of its use is for calls
To return from this call directly if you haven't called any other functions, simple and fast.
If you have pushed the LR onto the stack at the beginning of your called function, you return using a pop of the LR, simple, just takes the one extra hub cycle.
Update: This code works fine when the source and destination are the same in the case of doing a CALL, but does not update the PC/SP when the source is not the same as the destination, only LR gets the updated value. This might be okay, or we may want a second write in the m[4] cycle so we get both updated. I doubt we need the LR write functionality when the destination is not the PC in the RDLONG. It isn't much use otherwise. So maybe we can trigger the LR write not just on all RDLONG D, S WC WZ but only when D=S and WC, WZ are both present, and behave normally otherwise. This could help preserve the WZ behavior Bill is interested in too for the regular pop operations.
The stack pointer/program counter is still incremented when you do a RDLONG xxx, yyy WC WZ. This is so it skips past the stored 32 bit long used as the call target making it perfect for the LR.
As explained above I am keeping the WZ behavior the same when WC is not present. I know how it is used for mailbox polling, it's ideal for that.
But when looking at this, the m[0] cycle has the ram address set to "S" and it is latched at the end of the m[0] cycle, which is also the start of the m[1] cycle. The verilog says it is latched at posedge of clk_cog and m[2]. This caused me to rethink it and decide that the latching had to occur at the start of m[2] instead of the end of m[2].
Chip describes the states with m[0] being the S, m[1] being the D, m[2] being next I fetch and execute, with m[3] being write result. This made sense to me when looking at the latching occurring on the leading edge of the next state.
The conflict in this that I saw was the m states advance on the posedge of clk_cog. Therefore, I changed my mind and then discovered that the address on the ram was wrong.
So, do you have anything definitive that can help???
My only advice is don't believe all the comments in the Verilog regarding what exactly m[0], m[1], m[2] mean. I found that was very confusing and doesn't take into account lookup result latencies from the COG RAM. Best to draw it out with all the buses and map the states to what they do yourself. Remember all the setup is done at the start of the state in preparation for the next clock edge at which time everything updates and that the COG RAM is registered so there is an additional delay cycle from when you present the address to when you get the result.
RDLONG pc,pc wc nr ' would be LMM CALL that loads LR with incremented PC
RDLONG pc,pc wc ' would be LMM JMP, does not load LR
I don't think we need the 'NR' version of RDLONG's when WC=1 (or ever really, as condition codes can be used to stop it from executing)
This preserves the WZ attribute for all variations, and still allows selecting LR behaviour.
The assembler could be modified to make 'LR' be a synonym for 'NR'
I had thought about that too but it can't work as the NR bit already distinguishes between RDLONGs and WRLONGs. A RDLONG with NR is actually interpreted as a WRLONG in PASM today based on the instruction encodings. So right now I coded it up such that the extra LR loading step is triggered when D=S (with S not immediate) and WC WZ are both present.
eg.
RDLONG PC, PC WC, WZ
This form loads LR, but it would not modify Z. All other combinations of RDLONG/WRLONG etc involving WZ would still update Z as they do today. Now I cannot imagine a need to have the Z flag set in this case when we are jumping to a new PC read in from the hub code and it is a zero value, so it's probably the best way to go to get the best outcome with new LR functionality included.
I forgot that NR distinguishes between reads & writes! Thanks guys.
Ok, you have convinced me but only when D=PC, so perhaps (which is the only useful case for CALL/JMP anyway)
$1EF: PC
$1EE: LR
The reason is WZ is very useful when scanning arrays or following lists
consider
rdlong myptr, listptr wc
where we load pointers from an array of pointers pointed to by listptr, and increment listptr on every load
This would be used in keyword lookupgs etc