Hub Execution Model Thread (split from blog)

ozpropdev · 2013-12-04 12:32

David / Bill,

BIG is BIG!

Nice one!

David Betz · 2013-12-04 12:38

ozpropdev wrote: »

David / Bill,

BIG is BIG!

Nice one!

Thanks! It does seem to open up some possiblities even without the hub mode execution feature. We'll see if Chip thinks it is even possible. This may be a short-lived instruction! :-)

Bill Henning · 2013-12-04 12:44

THE FOLLOWING SUGGESTION IS FOR PROP 3

With "BIG" there is no longer any need to use

RDLONG reg,ptra++

to fetch 32 bit constants.

This made me think... why use PTRA for the PC then? Just use the internal cog PC, expanded to 18 bits.

Then I thought ... it would be nice to still be able to use cached reads without trashing the code cache...

Proposal for P3:

TWO 8-long caches

- one for RDWIDE/WRWIDE (aka "instruction cache")
- one for RDxxxC/WRxxxC (aka "data cache")

Of course, on the P3 we may also have space for more cache lines (4? 8?) each for I&D, but even a single line each would help, and would remove the restriction of not usinc RDxxxxC/WRxxxxC in hubexec mode.

This would leave PTRA available as a generic pointer for use by any hubexec code.

Again, this would help every compiler and virtual machine out there ... including Spin.

Cluso99 · 2013-12-04 13:12

I am not totally following your proposals, but the instruction(s) sound like a neat idea.
Flesh out what you are after and I will see if I can find the opcode space and then can put it to Chip with solutions.

David Betz · 2013-12-04 13:14

Bill Henning wrote: »

THE FOLLOWING SUGGESTION IS FOR PROP 3This made me think... why use PTRA for the PC then? Just use the internal cog PC, expanded to 18 bits.

This is what I originally suggested to Chip but I don't think it will work. You run into trouble when hub code uses the normal CALL/RET instructions to call a function in COG memory. There isn't enough room in the RET instruction for the full 18 bit address to allow you to return to the hub code address.

David Betz · 2013-12-04 13:18

Cluso99 wrote: »

I am not totally following your proposals, but the instruction(s) sound like a neat idea.
Flesh out what you are after and I will see if I can find the opcode space and then can put it to Chip with solutions.

I'm not sure if this comment is in regard to my BIG proposal but the idea would be to allow the S field of an instruction to express a full 32 bit value. This would be particularly useful for loading 32 bit constants but it could also be useful for extending the address range of RDxxxx/WRxxxx instructions or any instructions that take an immediate value in the S field. This would be handled by having a new instruction called BIG that would have a 23 bit immediate field and would just store the value from that field into a hidden 23 bit register. Then, the next instruction would combine that 23 bit value with the 9 bit value in its S field to form a 32 bit value that would be used as the value of S. The hidden register would have to be cleared to zero after each use. Because we have thread support there would need to be 4 copies of this 23 bit register, one for each thread.

Edit: To be clearer, bits 31:9 would come from the value in the BIG register and 8:0 would come from the S field of the instruction following the BIG instruction. Also, the way we (Eric) implemented this for the MPE, the programmer never saw BIG instructions. They were automatically generated by the assembler, if needed, to express more than what could be expressed by the smaller field in the original instruction.

ersmith · 2013-12-04 13:19

David Betz wrote: »

Do you know if LLVM has any more flexiblity in handling different address spaces? I'm not suggesting that we switch at this point but just wondering for future reference.

I don't know for sure, but I doubt it. There really aren't a lot of processors that have the kind of split address spaces that we're talking about, at least not that I'm aware of.

Bill Henning · 2013-12-04 13:22

When in hubexec mode, the HRET instruction would use the LR register for the 18 bit return address.

It could actually be the same opcode as RET, as the cog knows what mode it is in, and as such choose between the embedded 9 bit return address, or the LR register with the hub address.

When called with HCALLA / HCALLB, HRETA / HRETB would use the appropriate stacks top entry.

David Betz wrote: »

This is what I originally suggested to Chip but I don't think it will work. You run into trouble when hub code uses the normal CALL/RET instructions to call a function in COG memory. There isn't enough room in the RET instruction for the full 18 bit address to allow you to return to the hub code address.

David Betz · 2013-12-04 13:28

Bill Henning wrote: »

When in hubexec mode, the HRET instruction would use the LR register for the 18 bit return address.

It could actually be the same opcode as RET, as the cog knows what mode it is in, and as such choose between the embedded 9 bit return address, or the LR register with the hub address.

When called with HCALLA / HCALLB, HRETA / HRETB would use the appropriate stacks top entry.

Yes but how would CALL work when in hub mode for calling code that is resident in the COG (fcache maybe)? It shouldn't use LR since that would mean that the COG function couldn't call another COG function. In other words, the semantics of CALL would be different in hub mode than in COG mode.

Bill Henning · 2013-12-04 13:40

Ray, I am not sure if you are addressing BIG, the original hubex instructions in post#1 of this thread, the addition of a link register and cog-like call/returns in post#68, or the P3 suggestions in #124.

After chewing on it a bit, I will update post#1 regardless

Here is a rundown:

post #1 - the original simplest hub execution model, used SPA as per Chip only
post #68 - I mirrored the way pasm works for cog code, as David and Eric want a link register for PropGCC, with AUX stack support for assembly, spin, and other compilers
post #73 - you pointed out lack of dual operand opcodes, and suggested Chip merge DECODE3/4/5... an excellent suggestion
post #74 - encodes HJMP, HCALL, HCALLAR and HCALLBR into a signle dual operand instruction after your suggestion, HRET is assumed to go into any spare single op code
post #101 - David suggested LOADHI and LOADLO word instructions, which would need more dual operand space
post #106 - David suggested the "BIG" instruction like on a previous processor he and Eric worked with
post #111 - I suggested encoding BIG as a NOP (7 top bits 0's)
pos t#124 - expanding on David's BIG, for Prop3, I suggested that RDWIDE/WRWIDE have a separate 8-long cache from the other RDxxxxC/WRxxxxC instructions
as that would eliminate needing to use PTRA as the PC, and allow using the RDxxxxC/WRxxxxC instructions in hubexec mode without thrashing the instruction cache
effectively turning the WIDE cache into an instruction cache, and the RDxxxxxC cache into a data cache. For P3, multiple lines of 8 I&D cache lines could be supported

Basically I am waiting to see which direction Chip is leaning in.

opcode space needed:

1 dual operand opcode, probably recovered from the decodes - maps to HJMP / HCALL / HCALLA / HCALLB - see post #74 and #68
3 single operand opcodes (may fit in one using spare bits) - maps to HRET / HRETA / HRETB
0 changing NOP so it sets hidden 'BIG' register for effecting immediately following following instruction allowing construction of 32 bit constants
also requires NOP be a true NOP - cannot modify anything but the hidden BIG register, not change C or Z

I think it would be best if the LR was at a fixed location - say $1F1 - as then there is no need for a SETLR D/# instruction

It also makes sense to map the RDWIDE cache to $1E0, because then cleverly written hub code could freely reference all eight locations in its block for hub instructions, making hub reads writes very easy: rdlong reg, $1E1 would use the NOP (BIG) at that location to read hub location addressed by $1E1, avoiding an extra hub read cycle.

Cluso99 wrote: »

I am not totally following your proposals, but the instruction(s) sound like a neat idea.
Flesh out what you are after and I will see if I can find the opcode space and then can put it to Chip with solutions.

Bill Henning · 2013-12-04 13:44

Good point. Sorry, the fast pace had me cross-eyed.

For HUBEXEC mode:

HCALL saves return address in LR, HRET returns to address in LR

HCALLA saves return address on stack A, HRETA returns to address on top of stack A

HCALLB saves return address on stack B, HRETB returns to address on top of stack B

As for cog native code, works exactly as documented.

CALL/RET as normal use JMPRET

CALLAR / RETAR use stack A

CALLBR / RETBR use stack B

Quite symmetrical and Prop-like this way!

David Betz wrote: »

Yes but how would CALL work when in hub mode for calling code that is resident in the COG (fcache maybe)? It shouldn't use LR since that would mean that the COG function couldn't call another COG function. In other words, the semantics of CALL would be different in hub mode than in COG mode.

David Betz · 2013-12-04 13:48

Bill Henning wrote: »

Good point. Sorry, the fast pace had me cross-eyed.

For HUBEXEC mode:

HCALL saves return address in LR, HRET returns to address in LR

HCALLA saves return address on stack A, HRETA returns to address on top of stack A

HCALLB saves return address on stack B, HRETB returns to address on top of stack B

As for cog native code, works exactly as documented.

CALL/RET as normal use JMPRET

CALLAR / RETAR use stack A

CALLBR / RETBR use stack B

Quite symmetrical and Prop-like this way!

Okay but what value gets stored in the RET instruction if I execute a CALL from hub mode code? Is it the COG address in the 8-long window into hub memory? If so, what happens if the CALL is in the last of the 8 longs? Won't I return to an address outside of the 8-long window?

Bill Henning · 2013-12-04 14:01

Again, good question.

The more I think about it, the more I like a fixed location for the WIDE cache. $1E0 is an excellent candidate.

7 out of 8 cases are not a problem, leaving what happens if the CALL to the cog routine was in the eigth slot as you point out.

When jumping into the cache area with a JMPRET, simply resume hub execution at the next hub instruction location, which the PC will "remember" as the upper address bits would still be valid, and the lower bits would be supplied by the JMPRET.

When jumping to one instruction beyond the end of the cache from cog mode, fetch the next 8-longs (if they have already not been fetched) and resume at the next hub address.

If the 8-long cache is at a known, fixed address, then this takes very few gates.

David Betz wrote: »

Okay but what value gets stored in the RET instruction if I execute a CALL from hub mode code? Is it the COG address in the 8-long window into hub memory? If so, what happens if the CALL is in the last of the 8 longs? Won't I return to an address outside of the 8-long window?

Cluso99 · 2013-12-04 14:08

David & Bill,
I can try and find opcode space for instructions you might need. My reference was really to the BIG instructions, but any instructions.

I don't understand what DECOD3/4/5 and ENCOD do, so I am not sure if we can grab the WZ/WC bits. That would free 2 full instructions [#]D,S. There is plenty of instruction space for single operand instructions.

David,
If I understand your BIG instruction, that would load a register with the top 23 bits of an address.
Preferably, all other instructions that have #S (an immediate S - of 9 bits) would concatenate the BIG registers top 23 bits with the 9 immediate bits to form a 32 bit address.
After each of these instructions (those with #S), BIG would be cleared to $0.

Each instruction would then fetch the S value using the resultant 32 bit address. If the resultant address was <=$1FF, the cog register would be used. If the address was >$1FF then the instruction would fetch the long from hub, stalling the pipe until it read the value for the instruction to continue.
I am not sure this is a trivial code change, but it is likely to only affect the execution unit (Verilog code), not every instruction.

When using this type, could the GCC model be precluded from using multi-tasking? This would sure simplify things.

Cluso99 · 2013-12-04 14:12

Does it help your case if AUX could map more than one 8 Long block into cog???

My thinking has been to have the RDWIDE instruction be a small state m/c capable of loading up to 8 WIDES in the background (i.e the whole AUX).

David Betz · 2013-12-04 14:15

Cluso99 wrote: »

David,
If I understand your BIG instruction, that would load a register with the top 23 bits of an address.
Preferably, all other instructions that have #S (an immediate S - of 9 bits) would concatenate the BIG registers top 23 bits with the 9 immediate bits to form a 32 bit address.
After each of these instructions (those with #S), BIG would be cleared to $0.

Yes, this is how I propose that BIG work. The BIG instruction supplies bits 31:9 to the following instruction's S field.

Each instruction would then fetch the S value using the resultant 32 bit address. If the resultant address was <=$1FF, the cog register would be used. If the address was >$1FF then the instruction would fetch the long from hub, stalling the pipe until it read the value for the instruction to continue.

What I was proposing wasn't this complicated. The 32 bit value would be used as an immediate value in the instruction that follows the BIG instruction. There would be no range checking and no hub accesses unless the instruction happened to be RDxxxx or WRxxxx. In that case, the 32 bit immediate value would be the hub address for the hub access. Really, nothing in the COG processor would need to change except the handling of immediate values in the S field.

I am not sure this is a trivial code change, but it is likely to only affect the execution unit (Verilog code), not every instruction.

I don't think it would even affect the execution unit. It would only affect the instruction decoder that handles the forming of immediate operands.

When using this type, could the GCC model be precluded from using multi-tasking? This would sure simplify things.

I think there should be one 23 bit "big" register for each thread. That way the BIG instruction could be used even when threading was in use.

Bill Henning · 2013-12-04 14:16

My preference would be to use NOP's low 23 bits as the low 23 bits of the hidden BIG register.

This would allow using it directly as a hub address for {RD|WR}xxxxx{C}

For building large constants, the sssssss bits could supply the top 9 bits, if BIG was the preceeding instruction.

Using NOP's like this also makes for great tables of hub addresses, that would be harmless to execute.

Edit: I saw David's response

I agree with David, there should be a BIG register for each task.

Cluso99 wrote: »

David & Bill,
I can try and find opcode space for instructions you might need. My reference was really to the BIG instructions, but any instructions.

I don't understand what DECOD3/4/5 and ENCOD do, so I am not sure if we can grab the WZ/WC bits. That would free 2 full instructions [#]D,S. There is plenty of instruction space for single operand instructions.

David,
If I understand your BIG instruction, that would load a register with the top 23 bits of an address.
Preferably, all other instructions that have #S (an immediate S - of 9 bits) would concatenate the BIG registers top 23 bits with the 9 immediate bits to form a 32 bit address.
After each of these instructions (those with #S), BIG would be cleared to $0.

Each instruction would then fetch the S value using the resultant 32 bit address. If the resultant address was <=$1FF, the cog register would be used. If the address was >$1FF then the instruction would fetch the long from hub, stalling the pipe until it read the value for the instruction to continue.
I am not sure this is a trivial code change, but it is likely to only affect the execution unit (Verilog code), not every instruction.

When using this type, could the GCC model be precluded from using multi-tasking? This would sure simplify things.

David Betz · 2013-12-04 14:18

Bill Henning wrote: »

My preference would be to use NOP's low 23 bits as the low 23 bits of the hidden BIG register.

This would allow using it directly as a hub address for {RD|WR}xxxxx{C}

For building large constants, the sssssss bits could supply the top 9 bits, if BIG was the preceeding instruction.

Using NOP's like this also makes for great tables of hub addresses, that would be harmless to execute.

Edit: I saw David's response

I agree with David, there should be a BIG register for each task.

I think the 9 bit S field of the instruction following the BIG instruction should supply bits 8:0 of the immediate constant. That way the instruction parsing happens identically whether the BIG instruction is present or not. It's a smaller change. What advantage do you gain by using the BIG value as the low bits?

SRLM · 2013-12-04 14:22

Bill Henning wrote: »

Here we disagree to some extent [ about the use of AUX RAM ]. For microcontroller type of programs it would suffice, and be faster.

To throw in some numbers with the current PropGCC, I just had to increase the stack for my CMM cog from 176 longs to 276 longs because I was getting a stack overflow. I'm only calling 1 function level deep, but I have local objects that take up the rest of the space.

Bill Henning · 2013-12-04 14:26

A pretty BIG advantage for compilers and hub assembly code. (sorry, I could not resist that obvious pun)

Consider the following:

BIG hubaddress
RDLONG reg,#0 ' simpler than splitting up the address, and in most cases, the top 9 bits of the address would be zero

- Returns contents of hub[hubaddress], without having to put the lower nine bits into the RDLONG - easier for compilers.

- It also allows hubaddress to be referenced by other instructions in the 8-long cache (if it is at a fixed location) as no need to fill low 9 bits

- it allows constructing jump tables, as no need to fill low 9 bits

It still works great to load 32 bit constants:

BIG low23bits
MOV reg,#high9bits

I think NOP should absorb the BIG functionality, after all, it is not supposed to change the program state.

If NOP is used, and the top 7 bits are enough to make it a NOP/BIG (ie not affect anything outside of BIG), then the C or Z flags could be used to indicate top/bottom 23 bits in BIG.

David Betz wrote: »

I think the 9 bit S field of the instruction following the BIG instruction should supply bits 8:0 of the immediate constant. That way the instruction parsing happens identically whether the BIG instruction is present or not. It's a smaller change. What advantage do you gain by using the BIG value as the low bits?

Bill Henning · 2013-12-04 14:30

Thank you, good data.

I write drivers in assembly or C :-)

Drivers written in C, without local objects, would not use anywhere near that much stack.

SRLM wrote: »

To throw in some numbers with the current PropGCC, I just had to increase the stack for my CMM cog from 176 longs to 276 longs because I was getting a stack overflow. I'm only calling 1 function level deep, but I have local objects that take up the rest of the space.

David Betz · 2013-12-04 14:33

Bill Henning wrote: »

A pretty BIG advantage for compilers and hub assembly code. (sorry, I could not resist that obvious pun)

Consider the following:

BIG hubaddress
RDLONG reg,#0 ' simpler than splitting up the address, and in most cases, the top 9 bits of the address would be zero

I would recommend having the BIG instructions generated automatically by the assembler. This is what we did for the MPE. The programmer never coded a BIG instruction directly. The assembler handled it based on the value of the immediate operand and whether it would fit in the instruction's own immediate field.

Bill Henning · 2013-12-04 14:36

Good point, but does not address the two other advantages of using the BIG value as the low 23 bits:

- It also allows hubaddress to be referenced by other instructions in the 8-long cache (if it is at a fixed location) as no need to fill low 9 bits

- it allows constructing jump tables, as no need to fill low 9 bits

Both of the above would also be safe to execute.

I believe gcc's implementation of case statements could benefit from jump tables.

David Betz wrote: »

I would recommend having the BIG instructions generated automatically by the assembler. This is what we did for the MPE. The programmer never coded a BIG instruction directly. The assembler handled it based on the value of the immediate operand and whether it would fit in the instruction's own immediate field.

David Betz · 2013-12-04 14:38

Bill Henning wrote: »

Good point, but does not address the two other advantages of using the BIG value as the low 23 bits:

- It also allows hubaddress to be referenced by other instructions in the 8-long cache (if it is at a fixed location) as no need to fill low 9 bits

- it allows constructing jump tables, as no need to fill low 9 bits

Both of the above would also be safe to execute.

I believe gcc's implementation of case statements could benefit from jump tables.

I don't understand the comment about jump tables. Why would one use BIG to express the address of hub locations in a jmp table? Why not just use .long?

Sapieha · 2013-12-04 14:39

Hi.

I think It is not possible to use NOP.

As WAIT use it already --

Bill Henning · 2013-12-04 14:48

It is not a huge matter, but it would allow code to fall-through in case of a conditional jump table, without modifying cog state.

still leaves...

- It also allows hubaddress to be referenced by other instructions in the 8-long cache (if it is at a fixed location) as no need to fill low 9 bits

Using it for the low bits is basically a strategy for saving longs in the executable.

You have now peaked my curiosity - can you provide some examples where it helps to have it load the top 23 bits?

(yet another good discussion!)

David Betz wrote: »

I don't understand the comment about jump tables. Why would one use BIG to express the address of hub locations in a jmp table? Why not just use .long?

Bill Henning · 2013-12-04 14:51

Ouch!

I just checked the latest docs,

0000000 ZC I CCCC DDDDDDDDD SSSSSSSSS RDBYTE D,S/PTRx

So much for using NOP!

Sapieha wrote: »

Hi.

I think It is not possible to use NOP.

As WAIT use it already --

David Betz · 2013-12-04 14:52

Bill Henning wrote: »

It is not a huge matter, but it would allow code to fall-through in case of a conditional jump table, without modifying cog state.

still leaves...

- It also allows hubaddress to be referenced by other instructions in the 8-long cache (if it is at a fixed location) as no need to fill low 9 bits

Using it for the low bits is basically a strategy for saving longs in the executable.

You have now peaked my curiosity - can you provide some examples where it helps to have it load the top 23 bits?

(yet another good discussion!)

I understand but it complicates the implementation. The COG still has to work even without the presence of any BIG instructions and that means that instructions that take an immediate operand will have to be parsed differently depending on whether they are preceeded by BIG. It also means that you can't just use BIG for every immediate value. You need to remember if the previous instruction was BIG which will involve one more flop. I figure the least risky way of doing this is to say that the COG instruction parser always ORs S values with BIG without regard to whether there was a BIG instruction preceding it. That combined with clearing the BIG register after use means that there isn't a special case for handling BIG.

ozpropdev · 2013-12-04 14:54

Cluso99 wrote: »

I don't understand what DECOD3/4/5 and ENCOD do, so I am not sure if we can grab the WZ/WC bits.

Ray,

DECOD5 takes the lower 5 bits and decodes it into a single bit mask.

DECOD5 reg2

replaces

MOV reg,#1
SHL reg1,reg2

DECOD4 takes the lower 4 bits and creates a 16 bit mask. The resulting mask is copied to the 2 word positions.
DECOD3 takes the lower 3 bits and creates a 8 bit mask. The resulting mask is copied to the 4 byte positions.

ENCOD does the reverse.

Example

ENCOD reg1,#%10000 would return the value 5 to represent the 5th bit is set.
Values returned are in the range of 1 to 32.
A zero result represents no bit is set.

IIRC if multiple bits are set, it returns the most significant bit.

Maybe these can be shrunk into one opcode by using WZ,WC as suggested.

Bill Henning · 2013-12-04 14:55

You have convinced me, and with %0000000 no longer being the NOP, the longs potentially saved are not worth it (for P2).

Good discussion!

David Betz wrote: »

I understand but it complicates the implementation. The COG still has to work even without the presence of any BIG instructions and that means that instructions that take an immediate operand will have to be parsed differently depending on whether they are preceeded by BIG. It also means that you can't just use BIG for every immediate value. You need to remember if the previous instruction was BIG which will involve one more flop. I figure the least risky way of doing this is to say that the COG instruction parser always ORs S values with BIG without regard to whether there was a BIG instruction preceding it. That combined with clearing the BIG register after use means that there isn't a special case for handling BIG.

Hub Execution Model Thread (split from blog)

Comments