NEW enhanced LDPTR/STPTR operations for direct COGRAM stack access

rogloh · 2014-09-23 08:02

In a separate post, I described some new LOAD/STORE instructions that would be both fast and useful for reading from COGRAM indirectly, for example during pointer dereference operations when running assembled C code.

Following on from that I decided to incorporate another new feature to allow direct access to data on the COGRAM stack (both reading and writing) in a single cycle. It was a bit trickier to do but I found a way to reuse the ALU in the M0 cycle to add the current stack pointer value to an immediate constant for computing the COG RAM address to be accessed in the remainder of the instruction cycle. It gets tricky as it also involves dynamic reuse of D/S registers in the middle of the instruction for passing this data around and accessing the ALU.

A summary of the two new instructions is this:

LDPTR D,#S  '  D register gets the COGRAM data at address SP+#S    or   D<=*(SP+#S)
STPTR D,#S  '  D register data written to COGRAM at address SP+#S  or   *(SP+#S)<=D

What this now allows are COGRAM<>register transfers to/from any data variables maintained on the stack in a single instruction cycle. In the case of running assembled C code this gives direct access to the currently executing function's local variables and any function arguments passed on the stack (if not already passed in via registers). The compiler will know the offsets of the local variables relative to the base of the stack frame and can use this to setup the immediate #S value accordingly in the PASM instruction. The 9 bit immediate range allows up to 512 longs worth of local data in a function from the base of the stack to be accessed in a single instruction. Any more than that distance will not be accessible in a single cycle but can still be accessed using an extended method (see later).

Eg. Take some simple C code doing this nonsense:

int function adder(int v1, int v2)
{
	int x=3;

	x = v1+v2;
	return x;
}

This can now be compiled fairly easily into this (unoptimized) PASM function equivalent below.

adder
SUB   sp,#1   ' make room for local variable x on the stack
MOV   r1,#3   ' we are loading a small constant so no need for AUGS
STPTR r1,#0   ' write this constant to local variable x on the stack at offset #0
LDPTR r1,#3   ' addresses and reads v1 argument from stack 
LDPTR r2,#2   ' addresses and reads v2 argument from stack (#1 gets skipped as it holds return PC pushed after the arguments)
ADD   r1,r2   ' add v1 and v2
STPTR r1,#0   ' write sum to x
ADD   sp,#1   ' cleanup locals
RETX  #2      ' return and skip two input arguments

Note that here I am assuming the result is returned in r1 but arguments are still passed on the stack (I know it's not necessarily realistic it's just to show the operations possible now).

To make room for this feature I had to redefine my original LOAD D,#S instruction. It turns out that this immediate form of LOAD D,#S is not particularly useful because it can be implemented equivalently by just doing a simple MOV D,S. Therefore we won't need to keep this LOAD D,#S around anymore. So it has been replaced by the two instructions above and is using the WZ flag modifier to differentiate between LDPTR and STPTR. Accordingly we can't write the Z flag with these two instructions anymore, unlike the other LOAD/STORE operations which still currently retain it. A slight annoyance but I believe I can live with that when reading from the stack, and we can just use a following instruction to set the Z flags whenever that zero test is required.

Now by default it uses the stack pointer to compute the COGRAM address so in this case we just access stack variables only. But my plan is to enhance this further very soon by providing us with a future AUGPTR type of instruction to compute a base address to be used instead of the stack pointer. This base address would just be used once in the LDPTR/STPTR instruction that immediately follows it, similar to how Cluso's AUGDS/AUGS works for example. This way we can share the functionality of LDPTR/STPTR between stack access and general pointer access and we don't need to burn another instruction opcode for that variant.

So we could then do something like this:

AUGPTR Base, IndexOffset  ' Computes new pointer base for next instruction by adding two arguments, note that NR is implicit so Base does not need to get trashed 
LDPTR D, #Element_Offset  ' Adds pointer base to #Element_Offset then dereferences the pointer and writes the looked up value in COGRAM to the D register

which is equivalent to D <= COGRAM[Base + Index_Offset + #Element_Offset]

IndexOffset could have been computed based on the size of each element and some array index. We get to add two registers and a constant to form the address.

An AUGPTR & STPTR pair works the same way but writes data from register address D to this resulting address in COGRAM.

Examples of C code that could readily make use of this enhanced pointer capability are commonly done things such as:

x = *(p+3);
x = p->element;
x = array[200].element;
x = struct.element;

This combination of PASM code is now very powerful as you can use it to compute lookup addresses and access elements of structure arrays quickly and flexibly. Eg, You can compute an index offset based on the sizeof the array element and index, add it to the base start address of the variable and also add a fixed offset to choose an element of the structure, all in very few instructions.

Also with an additional preceding AUGS instruction we can now access elements on the stack that fall outside the (presumably far more common) 0-511 range from SP, just by extending the #S value, eg. something like this should ultimately be doable too:

AUGS     #(Index >> 9)
LDPTR D, #(Index & $1ff)

When coding this, I found we can still fit this in with Cluso's instructions using the following encoding scheme, still giving us room for having both an AUGPTR #D,#S and an AUGPTR D,S form (Cluso has not done it yet AFAIK but I believe the latter form could be handy too). We even have a couple of free instructions left by making use of WZ and IM (000110_110x). I may dream up something else soon for that as I keep thinking about what else will speed up assembled C running in a COG.

' iiiiii_zcri_cccc_ddddddddd_sssssssss

LOAD  D, S      ' 000110_x010_cccc_ddddddddd_sssssssss    optional WZ functions as expected (ie. sets Z flag if written value = 0)
LDPTR D, #S     ' 000110_0011_cccc_ddddddddd_sssssssss    WZ already part of encoded instruction, so not allowed as an option
STPTR D, #S     ' 000110_1011_cccc_ddddddddd_sssssssss    WZ already part of encoded instruction, so not allowed as an option
STORE D, S/#    ' 000110_x11x_cccc_ddddddddd_sssssssss    optional WZ functions as expected (ie. sets Z flag if written value = 0) 

AUGDS #D,#S     ' 000110_0001_xxxx_ddddddddd_sssssssss    reserved for Cluso (distinguised from LOAD/STORE group above by using write result bit = 0)
AUGDS D,S       ' 000110_0000_xxxx_ddddddddd_sssssssss    reserved for Cluso as an indirect way of setting up Ish:Isl and Idh:Idl of next instruction
AUGPTR D,#S     ' 000110_1001_xxxx_ddddddddd_sssssssss    compute pointer for immediately following LDPTR/STPTR instead of stack pointer
AUGPTR D,S      ' 000110_1000_xxxx_ddddddddd_sssssssss    compute pointer for immediately following LDPTR/STPTR instead of stack pointer
AUGS  #S        ' 000110_010s_ssss_sssssssss_sssssssss    reserved for Cluso
RSVD?           ' 000110_110x_xxxx_ddddddddd_sssssssss    reserved

This attached code with this change includes MUL/MULS/PUSH/POP/RETX/CALLX/JMPX/LOAD/STORE/LDPTR/STPTR, but no AUG stuff yet. That needs a bit more work and integration with Cluso's changes. I am not quite yet ready for that part at the moment, still testing/playing with my own changes. Note that I've only done a quick check of the new operations but I haven't finished exhaustive testing to ensure I didn't break anything added previously so YMMV. Once it is all combined with the AUG stuff, tidied up a bit, and we have larger COGRAM space to play with I will want to retest it all again. Basically it's all a work in progress right now.

The time to compile the FPGA is probably about 1.5x now (at least for my DE-0 nano target) as more LEs and routing resource are being used for all the 32 bit buses and muxes being joined together. It's hardly a frugal implementation but the idea seems to be working out. In the final implementation I may plan to have this enabled for a subset of COGs only - I'm guessing I'll only need it for possibly just 1 or 2 COGs. That will bring it down in size again.

By the way I'm open for suggestions on better instruction names. They started out as LDSTACK, STSTACK but then I wanted to allow general purpose pointer access with AUGPTR so changed it to LDPTR/STPTR instead. Doesn't have to be this when its all finalized.

Cheers,
rogloh

Willy Ekerslyke · 2014-09-23 13:19

Very impressive once again Roger. I'll try these out as soon as I can.

As for the instruction names, LDPTR reads as 'load pointer' to me - but I can't come up with an alternative. Relying on the instruction mnemonic to convey meaning becomes a problem when the instruction is 'complex' but we want short mnemonics. So perhaps we need to think about extending the syntax of the assembler. For example:

adder
SUB   sp,#1          ' make room for local variable x on the stack
MOV   r1,#3          ' we are loading a small constant so no need for AUGS
MOV   (sp+#0),r1     ' write this constant to local variable x on the stack at offset #0
MOV   r1,(sp+#3)     ' addresses and reads v1 argument from stack
MOV   r2,(sp+#2)     ' addresses and reads v2 argument from stack (#1 gets skipped as it holds return PC pushed after the arguments)
ADD   r1,r2          ' add v1 and v2
MOV   (sp+#0),r1     ' write sum to x
ADD   sp,#1          ' cleanup locals
RETX  #2             ' return and skip two input arguments

I'm sure there will be varied opinions and ideas so perhaps opening it up for discussion now is a good idea..

jmg · 2014-09-23 15:44

rogloh wrote: »

The time to compile the FPGA is probably about 1.5x now (at least for my DE-0 nano target) as more LEs and routing resource are being used for all the 32 bit buses and muxes being joined together.

Do you have any 'added LE' cost, and MHz impact, figures of this new feature ?

jmg · 2014-09-23 15:48

Willy Ekerslyke wrote: »

.... So perhaps we need to think about extending the syntax of the assembler. For example:

Simple code clarity is always good, the assembler can easily do the work.
Another common ASM syntax is the @ form - see below
A modern ASM could support both forms.

adder
SUB   sp,#1          ' make room for local variable x on the stack
MOV   r1,#3          ' we are loading a small constant so no need for AUGS
MOV   (sp+#0),r1     ' write this constant to local variable x on the stack at offset #0
MOV   r1,(sp+#3)     ' addresses and reads v1 argument from stack
MOV   r2,(sp+#2)     ' addresses and reads v2 argument from stack (#1 gets skipped as it holds return PC pushed after the arguments)
ADD   r1,r2          ' add v1 and v2
MOV   (sp+#0),r1     ' write sum to x
ADD   sp,#1          ' cleanup locals
RETX  #2             ' return and skip two input arguments

MOV   @sp+#0,r1     ' alternate syntax, # can be optional if offset is always a immediate value.
MOV   r2,@sp+#2

Cluso99 · 2014-09-23 17:35

rogloh,
fantastic. I am not familiar with C and its resulting pasm output, but i can see the uses for pasm and the spin interpreter.

i believe we need ay least one big cog and perhaps a second with some of the features such as more cog ram.

I am busy atm with my family visiting from s.korea.

rogloh · 2014-09-23 18:08

Willy Ekerslyke wrote: »
Very impressive once again Roger. I'll try these out as soon as I can.

As for the instruction names, LDPTR reads as 'load pointer' to me - but I can't come up with an alternative. Relying on the instruction mnemonic to convey meaning becomes a problem when the instruction is 'complex' but we want short mnemonics. So perhaps we need to think about extending the syntax of the assembler. For example:
adder
SUB   sp,#1          ' make room for local variable x on the stack
MOV   r1,#3          ' we are loading a small constant so no need for AUGS
MOV   (sp+#0),r1     ' write this constant to local variable x on the stack at offset #0
MOV   r1,(sp+#3)     ' addresses and reads v1 argument from stack
MOV   r2,(sp+#2)     ' addresses and reads v2 argument from stack (#1 gets skipped as it holds return PC pushed after the arguments)
ADD   r1,r2          ' add v1 and v2
MOV   (sp+#0),r1     ' write sum to x
ADD   sp,#1          ' cleanup locals
RETX  #2             ' return and skip two input arguments
I'm sure there will be varied opinions and ideas so perhaps opening it up for discussion now is a good idea..

Thanks. We definitely should kick ideas around for the naming. Your example above is one possibility, though I think sharing MOV mnenomic for both regular MOV and this pointer+offset dereference in this way may get confusing for the user and also potentially difficult to differentiate between for the assembler when reading something like (sp+#). Using the form of MOV r1, @(sp+#3) or even MOV r1, *(sp+#3) may be nicer - we will want to include a way for the AUGPTR idea to work out too. Also it's not really equivalent to a simple MOV anymore as we can't use WZ/WC/NR either.

I know you've been adding some of this to the OpenSpin compiler recently which is great. Do bear in mind it is still subject to changes (including the encodings if I find optimizations), though I think we will be homing in on final set of capabilities reasonably soon.

rogloh · 2014-09-23 18:37

Do you have any 'added LE' cost, and MHz impact, figures of this new feature ?

@jmg: Not on its own in isolation. The overall DE-0 nano with everything in this list on all 8 COGs (MUL/MULS/PUSH/POP/RETX/CALLX/JMPX/LOAD/STORE/LDPTR/STPTR) comes out to 18029 LE's (or 81% of the DE-0 nanos Cyclone 4 based EP4CE22F17C6 FPGA). I found the speed dropped a lot when MUL was incorporated, so I am not sure what this latest change has done by itself. FMAX for the COG clock rate jumps about each time I compile it but is still in the vicinty of 80MHz. I am hopeful of us maintaining that rate and no optimizations helping out timing have been put in yet. I probably only need to be running at about 72MHz in my final application so I've got some more headroom. If someone needs the fastest clocked system this may not be right for them as it's feature rich rather than a blistering speed demon at the moment. I am okay with that. I just need it to beat an AVR for performance when running C.

The ALU LE logic count is 686 (of which MUL takes 92) and COG LE logic count is 567.

When compared to Chip's original baseline that got compiled without any of these changes, well that takes 14738 LEs of 66% of the DE-0 nano. Each of its ALUs takes 601 LEs and each COG logic takes 439 LEs. So we have effectively increased the ALU logic by 13% and COG logic by 29% with all these changes, but we've added a whole lot of new functionality/capabilities.

I think the logic usage will come back down a lot if we reduce it to just a couple of COGs with this feature. What's nice with how we've done this is that the new COGRAM based PUSH/POP/JMPX/CALLX/RETX stuff can still be enabled in isolation in its own opcode and the AUGxxx/LOAD/STORE/LDPTR/STPTR stuff can optionally be enabled on top of that if we want to run fast C with a larger COGRAM. MUL/MULS is basically independent too. So we can still use each overall feature separately to some degree.

rogloh · 2014-09-23 18:37

jmg wrote: »

Simple code clarity is always good, the assembler can easily do the work.
Another common ASM syntax is the @ form - see below
A modern ASM could support both forms.

adder
SUB   sp,#1          ' make room for local variable x on the stack
MOV   r1,#3          ' we are loading a small constant so no need for AUGS
MOV   (sp+#0),r1     ' write this constant to local variable x on the stack at offset #0
MOV   r1,(sp+#3)     ' addresses and reads v1 argument from stack
MOV   r2,(sp+#2)     ' addresses and reads v2 argument from stack (#1 gets skipped as it holds return PC pushed after the arguments)
ADD   r1,r2          ' add v1 and v2
MOV   (sp+#0),r1     ' write sum to x
ADD   sp,#1          ' cleanup locals
RETX  #2             ' return and skip two input arguments

MOV   @sp+#0,r1     ' alternate syntax, # can be optional if offset is always a immediate value.
MOV   r2,@sp+#2

Yeah I like the @ form too, maybe with parenthesis like @(sp+#0).

rogloh · 2014-09-23 19:04

Cluso99 wrote: »

rogloh,
fantastic. I am not familiar with C and its resulting pasm output, but i can see the uses for pasm and the spin interpreter.

i believe we need ay least one big cog and perhaps a second with some of the features such as more cog ram.

I am busy atm with my family visiting from s.korea.

Thanks Cluso.

When (unoptimized) C is compiled on a typical microcontroller, in general you will find that it does the following types of operations over and over again...

read in some variables normally kept on the local stack frame into registers
perform some ALU operation or test
write result back to variable in memory

It's a very common sequence and happens everwhere. Declaring "register" variables and using an optimizing compiler will try to keep things held in registers as much as possible to avoid the memory access but at some point it will most probably need to read or write the data from/to a local variable in main memory. The LDPTR and STPTR can optimize that read or write operation down to one instruction cycle making the memory access penalty very low now. Keeping it all in registers is still the fastest but not by such an enormous margin anymore as it once was.

For example, in PASM before without this feature, trying to access some software stack based locals would be very tedious and burn lots of instruction space and add extra delay as it also needs to use self modifying code. For instance, assume a software stack pointer in SP and us wanting to write data back to a local variable held at the 4th long offset from the stack pointer, we would need to do this:

MOV tmpptr, sp
       ADD tmpptr, #4
       MOVD writer, tmpptr
       NOP ' need a delay slot here
writer MOV 0-0, reg

We can achieve all this in one instruction now (a 5x speedup and code reduction):

STPTR reg, #4

Hoping to merge in your AUG stuff soon and try it all out on a larger COGRAM.

rogloh · 2014-09-23 20:49

Looking over my instruction encodings I think I have found an improvement to still get keep the WZ functionality for all my indirect loads and free an instruction as well.

Using WR=1 (keeping WR=0 reserved for all the AUGxx stuff), we have 8 instruction variants for opcode (000110) using #, WC, WZ modifier combinations. A better usage may be something like this for example:

LOAD  D,S 
LOAD  D,S WZ        (WZ) sets Z if read value = 0
LDPTR D,#S          (#) 
LDPTR D,#S WZ       (WZ #)  sets Z if read value = 0
STORE D,S           (WC)
STORE D,#S          (WC, #)
(RSVD D,S)          (WC,WZ)
STPTR D,#S          (WC, WZ, #)

I prefer this as it makes more sense to be able to load a variable and optionally set Z when you read it rather than when you write it. You can very easily set Z beforehand when forming the register to be written to memory. So I believe it makes more sense to sacrifice the WZ variants of STORE. I even get a free instruction too.

I will look at making some changes there when I can.

PS. By the way, I like the idea of having a sign extended form for AUGDS #, # too so you can use it to create decently sized negative constants (ie. 18 bits sign extended to 32 for #S by replicating bit17 upwards and at the same time access any COGRAM destination register in the next instruction). That way we can do something this:

AUGDSS #(D>>9),(#-1300 >> 9)  
MOV    (D & $1ef), #(-1300 & $1ef)

The "AUGDSS" here means use sign extended form, AUGDS/AUGDSZ might be the default (zero extended). We just need to sign extend #S for immediate constant use, never D. AUGS already allows us to set the sign bit, so that one is fine.

Bill Henning · 2014-09-24 07:59

REALLY nice work!

rogloh · 2014-09-24 08:40

Thanks Bill.

As an update, tonight I have just made a change to use the newer encoding format in my last post above and will be testing it again soon. Keeping WZ for LOAD and LDPTR is really useful as it gives us the ability to get a variable from the stack or a pointer and set Z at the same time.

That should really help optimize all C code like this...

if (x) 
   statement;

if (*p)
   statement;

because you can do that sort of thing now in just a couple of PASM instructions...

       LDPTR r1,#4 WZ   ' x is 4 longs away from stack base
if_z   JMP #skip

or

       LOAD r1, r2 WZ    ' r2 is p, r1 is *p
if_z   JMP #skip2

I also ended up using my reserved opcode to make a GETPTR D,S instruction which returns the address computed as (SP+#S) into D. Note that even though the immediate bit is not encoded into this instruction variant it is implied that this operation uses the #S form, like Cluso's AUGS does. I had to do it like that to fit.

I expect GETPTR should become useful for doing this type of thing below - it's reasonably common to be able to take the address of a local variable in C...

char *p=&x;

which now just becomes this snippet in PASM:

GETPTR r1,3      ' if x is stored at SP+3, its address will be returned to r1.
STPTR r1,#2      ' if p is stored at SP+2 on stack frame, write r1 to it

It should also work with my proposed AUGPTR too when I get that far.

Roger.

jmg · 2014-09-24 12:50

rogloh wrote: »

Yeah I like the @ form too, maybe with parenthesis like @(sp+#0).

That works for me

Tho I would have the ASM accept also @(sp), as a legal equivalent of the special case of @(sp+#0)
(ie generates @(sp+#0), but allows user to say @(sp) )

NEW enhanced LDPTR/STPTR operations for direct COGRAM stack access

Comments