Propeller II update - BLOG

cgracey · 2013-12-13 18:55

Bill Henning wrote: »

It may be simpler to implement a

CALL_LR D/#16b

That stores the return address in LR (suggest $1F1)

and a RET_LR would simply be a macro for

JMP $1F1

as this would have a nice embedded 16 bit address in a single-long instruction.

What if we made an 8-level LIFO stack that had special instructions simply called CALL/CALLD and RET/RETD, and those would replace use of JMPRET/JMPRETD. JMPRET/JMPRETD could then just be used for register-based thread loops that track Z/C/PC. Would we miss anything? We wouldn't need to mess with subroutine_RET labels, anymore.

Edit: We'd need one of these for each task. 8 cogs x 4 stacks x 8 levels x 19 bits (hub/cog mode, Z, C, PC[15:0] = 4,864 flops.

David Betz · 2013-12-13 19:29

cgracey wrote: »

What if we made an 8-level FIFO stack that had special instructions simply called CALL/CALLD and RET/RETD, and those would replace use of JMPRET/JMPRETD. JMPRET/JMPRETD could then just be used for register-based thread loops that track Z/C/PC. Would we miss anything? We wouldn't need to mess with subroutine_RET labels, anymore.

Edit: We'd need one of these for each task. 8 cogs x 4 stacks x 8 levels x 19 bits (hub/cog mode, Z, C, PC[15:0] = 4,864 flops.

There might be an advantage to having the LR register be user visible. It could be used for table dispatches for instance. Also, your FIFO change would prevent the use of JMPRET cooperative threads although maybe that isn't necessary anymore now that we have hardware tasks.

cgracey · 2013-12-13 19:33

David Betz wrote: »

There might be an advantage to having the LR register be user visible. It could be used for table dispatches for instance. Also, your FIFO change would prevent the use of JMPRET cooperative threads although maybe that isn't necessary anymore now that we have hardware tasks.

JMPRET could still be used for cooperative threading.

David Betz · 2013-12-13 19:43

David Betz wrote: »

There might be an advantage to having the LR register be user visible. It could be used for table dispatches for instance. Also, your FIFO change would prevent the use of JMPRET cooperative threads although maybe that isn't necessary anymore now that we have hardware tasks.

Okay, how about this? You implement your FIFO but make the top value on the FIFO visible at $1F1. That way you have the CALL_LR instruction that I originally proposed but also the opportunity to nest functions up to 8 levels deep. Would it be hard to expose the top-of-stack at $1F1?

cgracey · 2013-12-13 19:54

David Betz wrote: »

Okay, how about this? You implement your FIFO but make the top value on the FIFO visible at $1F1. That way you have the CALL_LR instruction that I originally proposed but also the opportunity to nest functions up to 8 levels deep. Would it be hard to expose the top-of-stack at $1F1?

Would it be okay if we made push and pop instructions, instead of showing it at $1F1?

David Betz · 2013-12-13 19:58

cgracey wrote: »

Would it be okay if we made push and pop instructions, instead of showing it at $1F1?

I guess so but it would make the table dispatch idea take an extra instruction.

With LR:

    mov r0, #2 ' or some code to compute a case value
    call_lr #dispatch
    long case0
    long case1
    long case2
    long case3
    ....
    long caseN

dispatch
    add lr, r0
    jmp lr

With stack:

    mov r0, #2 ' or some code to compute a case value
    call_lr #dispatch
    long case0
    long case1
    long case2
    long case3
    ....
    long caseN

dispatch
    pop lr
    add lr, r0
    jmp lr

Edit: Actually, I suppose you would never do things like this anyway. Maybe it was a dumb idea! :-(

Cluso99 · 2013-12-13 20:10

cgracey wrote: »

What if we made an 8-level FIFO stack that had special instructions simply called CALL/CALLD and RET/RETD, and those would replace use of JMPRET/JMPRETD. JMPRET/JMPRETD could then just be used for register-based thread loops that track Z/C/PC. Would we miss anything? We wouldn't need to mess with subroutine_RET labels, anymore.

Edit: We'd need one of these for each task. 8 cogs x 4 stacks x 8 levels x 19 bits (hub/cog mode, Z, C, PC[15:0] = 4,864 flops.

That's a lot of flops for a specific purpose.

Could AUX be divided into 4 stacks of 8 levels? Aux is also directly addressable so maybe this would also solve Davids LR?

David Betz · 2013-12-13 20:15

Cluso99 wrote: »

That's a lot of flops for a specific purpose.

Could AUX be divided into 4 stacks of 8 levels? Aux is also directly addressable so maybe this would also solve Davids LR?

I'm afraid I'm not a big fan of all of these small special purpose memories that have dedicated instructions to manipulate them. You say AUX is directly addressable. I guess I've lost track of the current instruction set but I thought AUX was only addressable by load/store instructions. Or can I do something like this:

    add r0, aux[r1]

I like the idea of LR because it is a normal COG register and can be used anywhere that a COG register can be used. Is that also true of AUX locations?

Cluso99 · 2013-12-13 20:21

Seems to me that the multitasking is becoming so complex that we might be better off with 16 simpler cogs???

Cluso99 · 2013-12-13 20:25

David Betz wrote: »
I'm afraid I'm not a big fan of all of these small special purpose memories that have dedicated instructions to manipulate them. You say AUX is directly addressable. I guess I've lost track of the current instruction set but I thought AUX was only addressable by load/store instructions. Or can I do something like this:
    add r0, aux[r1]
I like the idea of LR because it is a normal COG register and can be used anywhere that a COG register can be used. Is that also true of AUX locations?

Currently, NO it cannot.

However, to me it seems that AUX, the HUBEXEC instruction cache, the RDxxxxC and now WIDE are all special purpose memories that would be better served if they were just a larger cog ram extension, accessible by some bank style of "BIG" (the new AUGS/D) method.

I am sure there would be a simple way of doing this.

David Betz · 2013-12-13 20:43

Cluso99 wrote: »

Currently, NO it cannot.

However, to me it seems that AUX, the HUBEXEC instruction cache, the RDxxxxC and now WIDE are all special purpose memories that would be better served if they were just a larger cog ram extension, accessible by some bank style of "BIG" (the new AUGS/D) method.

I am sure there would be a simple way of doing this.

I almost suggested expanding COG memory instead of adding AUX or a return stack but then remembered that Chip has already said that COG memory is very expensive because it has three read ports and a write port or maybe more. There is probably not a one-to-one exchange of AUX locations for COG locations for example.

Cluso99 · 2013-12-13 21:07

David Betz wrote: »

I almost suggested expanding COG memory instead of adding AUX or a return stack but then remembered that Chip has already said that COG memory is very expensive because it has three read ports and a write port or maybe more. There is probably not a one-to-one exchange of AUX locations for COG locations for example.

IIRC from Chip & Beau's die sizes, the AUX consumes 0.2mm2 and the COG (rams) 0.6mm2. Aux is 1/2 COG size. So AUX consumes 0.4 vs 0.6 for equal sizes. But we are adding flops which IIRC Chip says use a lot more space than the dedicated memory that he & Beau built. Using that equation, if Aux were converted to Cog, then 256 longs would use 0.3 instead of 0.2 (times 8 for 8 cogs).

With the new 16 bit PC fields, a contiguous address space could be used, where just the hub $00000-00400 (mostly ROM) would not be mapped.

Using a few tricks, such as "BIG" and the setting of an AUX bit, we could remap the AUX into Cog simply, and then use it for D & S values, such that all the normal and very useful instructions like AND/XOR/etc would work on the AUX as well. Probably we would not then require all the extra AUX access instructions.

The HUBEXEC instruction cache could use a block of AUX, as could the WIDE cache, etc.

I am sure it is not only workable, but would simplify things considerably.

Of course, the downside is that the cog/aux ram would need to be partially redone, to make for "wide" blocks. Maybe that is too big a change and is on the critical path. It just seems to make the most sense to me.

evanh · 2013-12-14 02:00

Cluso99 wrote: »

IIRC from Chip & Beau's die sizes, the AUX consumes 0.2mm2 and the COG (rams) 0.6mm2. Aux is 1/2 COG size. So AUX consumes 0.4 vs 0.6 for equal sizes. But we are adding flops which IIRC Chip says use a lot more space than the dedicated memory that he & Beau built. Using that equation, if Aux were converted to Cog, then 256 longs would use 0.3 instead of 0.2 (times 8 for 8 cogs).

Sounds about right, although fitting the 2D layout might be an additional issue.

With the new 16 bit PC fields, a contiguous address space could be used, where just the hub $00000-00400 (mostly ROM) would not be mapped.

Such direct addressing mode can only be done as a single operand with 32 bit instructions. No more D & S in one 32 bit instruction. We're talking serious redesign.

Using a few tricks, such as "BIG" and the setting of an AUX bit, we could remap the AUX into Cog simply, and then use it for D & S values, such that all the normal and very useful instructions like AND/XOR/etc would work on the AUX as well. Probably we would not then require all the extra AUX access instructions.

Similar story here, not only is there instruction encoding reasons for the memory separations but, if I'm not mistaken, AuxRAM is intended to be used concurrently with CogRAM without needing a fifth port in a larger CogRAM.

EDIT: And parallel operations per clock are a key part of the multiple memories. Each port has it's own address and data bus pair. This means each port of each bank of your multi-banked CogRAM would need it's own independently muxable address and data buses.

David Betz · 2013-12-14 03:26

evanh wrote: »

Sounds about right, although fitting the 2D layout might be an additional issue.

Such direct addressing mode can only be done as a single operand with 32 bit instructions. No more D & S in one 32 bit instruction. We're talking serious redesign.

Similar story here, not only is there instruction encoding reasons for the memory separations but, if I'm not mistaken, AuxRAM is intended to be used concurrently with CogRAM without needing a fifth port in a larger CogRAM.

EDIT: And parallel operations per clock are a key part of the multiple memories. Each port has it's own address and data bus pair. This means each port of each bank of your multi-banked CogRAM would need it's own independently muxable address and data buses.

I agree. I don't think it would make much sense to try to merge AUX and COG at this late date. The only point I was trying to make is that I'd rather not add yet another hidden memory with special instructions for accessing it. I'd prefer the original LR idea where the return address is stored in a register that is in the COG address space.

Bill Henning · 2013-12-14 06:33

Interesting idea, however I think those flops would give much better performance if used to have a larger WIDE cache, or bigger cache for RDxxxxC.

If you can fit those flops, having a separate 4x8xlong wide cache for each task, with a separate 1x8xlong cache for each task would be a significant performance boost.

I think instead of LR, leaf functions can be called with the CALLX, using only one long of AUX, and avoiding the hub hit just like LR would. Simple.

cgracey wrote: »

What if we made an 8-level FIFO stack that had special instructions simply called CALL/CALLD and RET/RETD, and those would replace use of JMPRET/JMPRETD. JMPRET/JMPRETD could then just be used for register-based thread loops that track Z/C/PC. Would we miss anything? We wouldn't need to mess with subroutine_RET labels, anymore.

Edit: We'd need one of these for each task. 8 cogs x 4 stacks x 8 levels x 19 bits (hub/cog mode, Z, C, PC[15:0] = 4,864 flops.

David Betz · 2013-12-14 06:39

Bill Henning wrote: »

I think instead of LR, leaf functions can be called with the CALLX, using only one long of AUX, and avoiding the hub hit just like LR would. Simple.

You would also waste PTRX which might be a bigger deal than wasting a location in AUX memory. In addition, you have to pop the return address from the AUX stack before you can add it to a hub-based stack frame. Remember, GCC may not know whether a function is a leaf function at the call site if the function is in a separately compiled module so it might not be possible to select the appropriate CALLx instruction except at link time and maybe not at all if the call is through a pointer. It's probably best to have the compiler generate the same call instruction for all functions whether they are leaf functions or not. The only exception to that is probably a static function where the compiler can find all calls to it.

cgracey · 2013-12-14 06:40

Bill Henning wrote: »

Interesting idea, however I think those flops would give much better performance if used to have a larger WIDE cache, or bigger cache for RDxxxxC.

If you can fit those flops, having a separate 4x8xlong wide cache for each task, with a separate 1x8xlong cache for each task would be a significant performance boost.

I think instead of LR, leaf functions can be called with the CALLX, using only one long of AUX, and avoiding the hub hit just like LR would. Simple.

I just added a 4-level 32-bit-wide LIFO stack to each task. It's accessible via PUSH/POP and CALL/CALLD/RET/RETD. All operations take 1 clock. This is good for calls within the cog RAM, or shallow calls and quick parameter passing during hub execution.

JMPRET/JMPRETD has been changed to JMPSW/JMPSWD for jump-switch. JMPSW D,S/# jumps to S/# (while setting hub mode from S[18] and possibly restoring Z/C from S[17]/S[16] via WZ/WC) and stores {13'b0, hubmode, Z, C, PC[15:0]} into D. It can be used with 'INDA,++INDA' to make round-robin cooperative threads.

All calls now store {13'b0, hubmode, Z, C, PC[15:0]} and all returns restore them (WZ/WC are optional).

When we synthesize the logic block, if we have room, I'll increase these stack depths from 4 to 8. I think 4 is actually quite adequate for internal cog use, and every task has a set. The old 'CALL #label' and 'label_ret RET' pairings are history. This makes code cleaner to write and read, plus routines become reentrant, which isn't so practical for recursion, but it means tasks can independently call the same routine now without return-addresses getting over-written.

David Betz · 2013-12-14 06:42

cgracey wrote: »

I just added a 4-level 32-bit-wide FIFO stack to each task. It's accessible via PUSH/POP and CALL/CALLD/RET/RETD. All operations take 1 clock. This is good for calls within the cog RAM, or short calls and quick parameter passing during hub execution.

JMPRET/JMPRETD has been changed to JMPSW/JMPSWD for jump-switch. JMPSW D,S/# jumps to S/# and stores {13'b0, hubmode, z, c, pc[15:0]} into D. It can be used with INDA,++INDA to make round-robin threads.

Does this mean there will be no CALL instruction that saves its return address in LR?

Bill Henning · 2013-12-14 06:51

Looks good - and I agree with you, for internal use (drivers, leaf functions), the 4 level's should be good.

Instead of increasing this fifo's size for each stack, we'd get more performance if the WIDE cache was increased.

Good solution, does not use PTRX for just saving a single address!

cgracey wrote: »

I just added a 4-level 32-bit-wide FIFO stack to each task. It's accessible via PUSH/POP and CALL/CALLD/RET/RETD. All operations take 1 clock. This is good for calls within the cog RAM, or shallow calls and quick parameter passing during hub execution.

JMPRET/JMPRETD has been changed to JMPSW/JMPSWD for jump-switch. JMPSW D,S/# jumps to S/# (while setting hub mode from S[18] and possibly restoring Z/C from S[17]/S[16] via WZ/WC) and stores {13'b0, hubmode, z, c, pc[15:0]} into D. It can be used with INDA,++INDA to make round-robin threads.

All calls now store {13'b0,hubmode,Z,C,PC[15:0]} and all returns restore them (WZ/WC are optional).

When we synthesize the logic block, if we have room, I'll increase these stack depths from 4 to 8. I think 4 is actually quite adequate for internal cog use, and every task has a set.

cgracey · 2013-12-14 07:01

David Betz wrote: »

Does this mean there will be no CALL instruction that saves its return address in LR?

Is it a big loss if you have to do 'POP reg' to get the address?

It seems to me that if you want to do a quick table operation, use CALL/POP which uses the LIFO. It's fast.

David Betz · 2013-12-14 07:40

cgracey wrote: »

Is it a big loss if you have to do 'POP reg' to get the address?

It seems to me that if you want to do a quick table operation, use CALL/POP which uses the FIFO. It's fast.

I suppose it isn't a huge loss but it is more awkward.

cgracey · 2013-12-14 08:11

Bill Henning wrote: »

Instead of increasing this fifo's size for each stack, we'd get more performance if the WIDE cache was increased.

You mean for RDBYTEC/RDWORDC/RDLONGC, right? And you're talking about more 8-long cache lines than just one?

cgracey · 2013-12-14 08:12

David Betz wrote: »

I suppose it isn't a huge loss but it is more awkward.

It means one more instruction in your dispatcher, right?

jazzed · 2013-12-14 08:18

Chip,

LR is used everywhere in the GCC code. It is part of Eric's machine register spec. Not having LR slows things down.

That being said, I've been advised to encourage you to finish your changes. So, maybe you can wrap it up LR or not. I'll miss not having a SERDES.

cgracey · 2013-12-14 08:48

jazzed wrote: »

Chip,

LR is used everywhere in the GCC code. It is part of Eric's machine register spec. Not having LR slows things down.

That being said, I've been advised to encourage you to finish your changes. So, maybe you can wrap it up LR or not. I'll miss not having a SERDES.

I'll see about writing the return address into $1F1. And it needs to be written, not just windowed, because you guys want to perform subsequent operations on it.

I'll have to force $1F1 into the D address at stage 4 and force a write when a CALL/CALLA/CALLB/CALLX/CALLY comes through. And I'll need to get the return address onto the result bus. It might be a simple thing to do and would be useful for lots of things. I'll try to do it today.

cgracey · 2013-12-14 08:52

David Betz wrote: »
I guess so but it would make the table dispatch idea take an extra instruction.

With LR:
    mov r0, #2 ' or some code to compute a case value
    call_lr #dispatch
    long case0
    long case1
    long case2
    long case3
    ....
    long caseN

dispatch
    add lr, r0
    jmp lr
With stack:
    mov r0, #2 ' or some code to compute a case value
    call_lr #dispatch
    long case0
    long case1
    long case2
    long case3
    ....
    long caseN

dispatch
    pop lr
    add lr, r0
    jmp lr
Edit: Actually, I suppose you would never do things like this anyway. Maybe it was a dumb idea! :-(

I have a question: Is this #dispatch routine resident in the cog or the hub memory? I assume the caller is in the hub.

I ask because, if it's in the cog, the JMPSW instruction will save the return address to any register and jump to #dispatch.

If #dispatch is in the hub, you need the LR.

Thinking more about this, just writing to $1F1 wouldn't be adequate for more than one hub task. The new CALL/POP combo would work better because it pulls from the task's own LIFO task.

David Betz · 2013-12-14 09:08

cgracey wrote: »

I have a question: Is this #dispatch routine resident in the cog or the hub memory? I assume the caller is in the hub.

I ask because, if it's in the cog, the JMPSW instruction will save the return address to any register and jump to #dispatch.

If #dispatch is in the hub, you need the LR.

Thinking more about this, just writing to $1F1 wouldn't be adequate for more than one hub task. The new CALL/POP combo would work better because it pulls from the task's own FIFO task.

Sorry, I have a concert today and lots of errands to run before then. I guess I shouldn't be providing input when I don't really have time to think this through. Don't take my dispatch example too seriously. I just made it up as an example of how it would be nice to have the return address in a COG register. It isn't actual code generated by propgcc. If you need to decide this today or this weekend I guess you'll have to just do what you think is best and we'll find a way to make use of it in propgcc. My guess is that, if there is a return FIFO for the CALL instruction, every non-leaf C function will immediately pop the return address off that stack and push it onto the hub stack adding one extra instruction to the function prologue. Leaf functions could leave the return address on the stack. Either that or we'll end up using the hub stack call functions for everything. There is a big advantage to having a standard calling convention for all C functions so we can't easily use one call instruction for leaf functions and a different one for non-leaf functions. I hope Eric will correct me if I'm wrong here.

Also, I am aware that there would need to be four copies of the LR register. Maybe that's a deal breaker.

Sorry I can't be more responsive this weekend!

tonyp12 · 2013-12-14 09:14

So many tricks and round-around Is needed due to the 32bit limit.
Is next step to 64bit really needed?, 36bit or 48bit though sounds weird is just enough to clean up the Instruction OPs and give 1024 cog longs and also 10-12 bit direct values etc.

http://en.wikipedia.org/wiki/36-bit

cgracey · 2013-12-14 09:17

David Betz wrote: »

Sorry, I have a concert today and lots of errands to run before then. I guess I shouldn't be providing input when I don't really have time to think this through. Don't take my dispatch example too seriously. I just made it up as an example of how it would be nice to have the return address in a COG register. It isn't actual code generated by propgcc. If you need to decide this today or this weekend I guess you'll have to just do what you think is best and we'll find a way to make use of it in propgcc. My guess is that, if there is a return FIFO for the CALL instruction, every non-leaf C function will immediately pop the return address off that stack and push it onto the hub stack adding one extra instruction to the function prologue. Leaf functions could leave the return address on the stack. Either that or we'll end up using the hub stack call functions for everything. There is a big advantage to having a standard calling convention for all C functions so we can't easily use one call instruction for leaf functions and a different one for non-leaf functions. I hope Eric will correct me if I'm wrong here.

Also, I am aware that there would need to be four copies of the LR register. Maybe that's a deal breaker.

Sorry I can't be more responsive this weekend!

No problem, Dave. I appreciate your help with this, even if it's sporadic.

Right now, I'm thinking that the FIFO stacks solve this problem across all the tasks, already, and we may not be able to make it work any simpler.

One question: Would there be any value in a 'PEEK D' instruction that returns the last-written LIFO stack value without popping it? That would be mindless to add.

Anyway, have a great weekend.

David Betz · 2013-12-14 09:28

cgracey wrote: »

No problem, Dave. I appreciate your help with this, even if it's sporadic.

Right now, I'm thinking that the FIFO stacks solve this problem across all the tasks, already, and we may not be able to make it work any simpler.

One question: Would there be any value in a 'PEEK D' instruction that returns the last-written FIFO stack value without popping it? That would be mindless to add.

Anyway, have a great weekend.

I don't see any real value in a PEEK instruction. If you can't have a version of CALL that puts its return address in a COG register then I guess it's best to just do what you're already planning. By the way, my original proposal for LR included a SETLR instruction to set the COG register that would be used as LR. This had the advantage that, if used in hardware tasking mode, you could make the LR register one of the set of registers that gets remapped differently for each task. That means you wouldn't need a separate copy of LR for each task. The register remapping would handle that.

Propeller II update - BLOG

Comments