Hub Execution Model Thread (split from blog)

Bill Henning · 2013-12-08 08:39

David Betz wrote: »

My understanding is that BIG will extend any S field so making it not work with DJNZ would require a special case. Working with DJNZ is the default.

Probably true, can't wait to try it on the FPGA.

David Betz wrote: »

I don't object to you making comments about what would be easy or hard for you to implement in a code generator. I object to specific references to GCC since I am not aware that you have any experience with it. If you notice, I don't usually make definitive statements about how hard things would be with GCC but usually defer to Eric who has done most of the code generator work. I sometimes speculate as to what I think might be easy or hard but I don't feel like I'm in a position to make definite statements and I have worked in the PropGCC code generator at least a little. Like you, I've also worked on numerous compilers over the years so I have at least a basic understanding of how they work.

With all due respect, I can and will say what I feel is relevant.

You felt free to make suggestions on how to implement various functionality (aka register remapping, pc, ptra/ptrb) where the suggestions would not readily (IMHO) fit the hardware, and while I disagreed with your suggestion, I support your right to comment on it fully - therefore please respect my right to say what I please.

jazzed · 2013-12-08 09:07

Come on guys. Intolerance is so intolerable

David Betz · 2013-12-08 10:01

jazzed wrote: »

Come on guys. Intolerance is so intolerable

The problem is that statements like "this will be easy to do in GCC" have the effect of cutting off discussion about whether it is even true. There are many clever schemes that will be very useful to assembly language programmers but not so useful for compiled languages. If someone who commands respect like Bill comes in and says "it's fine for GCC" when that may not be true then we run the risk of ending up with a design that is again great for assembly programmers and VM implementers but not so good for native code compilers. Also, there are ways to make a compiler take advantage of clever instructions but to do that may involve more work on the code generator side than Parallax is willing to spend. I guess we need to decide if our goal is to make the absolute fastest chip for PASM programmers at the expense of the regularity that will make it a good target for native code compilers.

Bill Henning · 2013-12-08 10:17

You proposed:

  BIG #hubaddr>>9
  DJNZ  count,#hubaddr & $1ff

I proposed

  SUB count,#1
if_nz JMP #hubaddr

I said my example was easier for GCC, which got your knickers in a twist, and you tried to tell me not to comment on what is easy/not in gcc.

Pseduo-code for emitting gas for both cases, printf stye, exact syntax does not matter for discussion:

emit("BIG #%s\n>>9",bigaddr_symbol_name);
emit("djnz %s,#%s & $1ff",counter_name, symbol_name);

Other case

emit("SUB %s,#1\n",counter_name);
emit("if_nz hjmp #%s\n",symbol_name);

The difference is trivial, but somewhat simpler in the case I showed.

David Betz wrote: »

The problem is that statements like "this will be easy to do in GCC" have the effect of cutting off discussion about whether it is even true. There are many clever schemes that will be very useful to assembly language programmers but not so useful for compiled languages. If someone who commands respect like Bill comes in and says "it's fine for GCC" when that may not be true then we run the risk of ending up with a design that is again great for assembly programmers and VM implementers but not so good for native code compilers. Also, there are ways to make a compiler take advantage of clever instructions but to do that may involve more work on the code generator side than Parallax is willing to spend. I guess we need to decide if our goal is to make the absolute fastest chip for PASM programmers at the expense of the regularity that will make it a good target for native code compilers.

David Betz · 2013-12-08 10:20

Bill Henning wrote: »
You proposed:
  BIG #hubaddr>>9
  DJNZ  count,#hubaddr & $1ff
I proposed
  SUB count,#1
if_nz JMP #hubaddr
I said my example was easier for GCC, which got your knickers in a twist, and you tried to tell me not to comment on what is easy/not in gcc.

Pseduo-code for emitting gas for both cases, printf stye, exact syntax does not matter for discussion:

emit("BIG #%s\n>>9",bigaddr_symbol_name);
emit("djnz %s,#%s & $1ff",counter_name, symbol_name);

Other case

emit("SUB %s,#1\n",counter_name);
emit("if_nz hjmp #%s\n",symbol_name);

The difference is trivial, but somewhat simpler in the case I showed.

If you had read any of my comments about how BIG would be used, you'd know that I don't expect anyone to actually write a BIG instruction. It will be handled entirely automatically by the assembler possibly with a cue from the programmer like ## as suggested by some here. GCC will simply generate a single instruction with a larger than 9 bit immediate operand. So, generating a single instruction is easier than generating a pair of instructions. However, this discussion is pretty much irrelevant because neither is really that difficult.

So for your example:

  DJNZ count, #hubaddr

or maybe

  DJNZ count, ##hubaddr

David Betz · 2013-12-08 11:28

I wonder how Chip is doing with hub execute mode? I've very interested to see what he comes up with. I'm kind of hoping that he sees a better way to do this than any of us have come up with here so far. It's hard for any of us to guess what will work well with the COG Verilog. Could be that things we think would be difficult will actually be easy and vice versa. Anyway, Chip tends to find clever solutions I would never have thought of. It should be interesting!

potatohead · 2013-12-08 11:34

Me too.

This problem morphed into basically hardware LMM. As originally discussed, it's more complicated than it appears.

jazzed · 2013-12-08 12:48

potatohead wrote: »

Me too.

This problem morphed into basically hardware LMM. As originally discussed, it's more complicated than it appears.

The original idea was hardware LMM. That got morphed into something else

David Betz · 2013-12-08 12:50

jazzed wrote: »

The original idea was hardware LMM. That got morphed into something else

I'm not sure that's true. I remember asking Chip for the ability to execute code directly from hub memory by extending the PC with enough bits to address all of hub memory back when he first announced RDLONGC. I suspect the idea came up when he and Bill first discussed RDLONGC as well. As I remember it, the idea for RDLONGC came from a conversation that Bill and Chip had at one of the Propeller Expos.

potatohead · 2013-12-08 12:54

Yep. I think it really is a move from something to hardware LMM. It seemed that just executing from the HUB could make sense, and from there, it moved rapidly to being able to run larger programs, and why have LMM?

jazzed · 2013-12-08 14:15

David Betz wrote: »

I'm not sure that's true. I remember asking Chip for the ability to execute code directly from hub memory by extending the PC with enough bits to address all of hub memory back when he first announced RDLONGC. I suspect the idea came up when he and Bill first discussed RDLONGC as well. As I remember it, the idea for RDLONGC came from a conversation that Bill and Chip had at one of the Propeller Expos.

Ugh.

Sorry, but I never heard of that. I watched Bill wave his hands and bend Chip's ear for hours about LMM RDLONGC though at UPEXPO.

I was referring to what I wrote in post 3202

jazzed wrote: »

...

However it's still not really useful for large code performance (I.E. programs that won't fit in a COG). A HUB hogging COG will not be of much extra value because instructions need to be interpreted anyway. In the case of higher code density, all instructions need to be interpreted. In the case of using 32 bit instructions, we still need to interpret "LMM jumps", etc.... If all instructions could simply be fetched and executed from HUB without interpretation, the proposition would be more encouraging. Many instructions can be fetched and executed as is, but there are several that must be interpreted like a long jump that requires extra instructions maintaining the pointers, etc....

To me the idea was for hardware to fetch and execute hub instructions without needing the COG to have code in it. That is the LMM interpreter hack necessary in P1 because it was an after-thought would not be necessary in this new chip.

The mechanics of it are fairly simple: Start the hub exec engine with a special cognew, and the hardware starts at the PC with stack and heap parameters all passed in PAR.

The hub fetch and exec engine sets PC based on the instructions and increments PC for us without requiring COG instructions (HARDWARE LMM). The COG could keep registers, etc.. though. When it needs more instructions, it gets enough to fill the instruction cache so that waiting for the next HUB fetch cycle is not necessary.

There would not be any time wasted by software needing to fetch more instructions - the hardware would do that. HUB mode instructions handle jumps and calls for LMM-like macros. In the simplest and most general mode, stack would live in HUB, and data would live in HUB. That should be the basic proof of concept.

Being able to load/store data with BIG is a great, simple and necessary optimization obviously gleaned from experience with LMM.

Anything else that is critical can be gleaned from existing LMM machines (as mentioned here). We build on what we know works!

In performance optimizations where recursion is not allowed stack might live in the AUX thing (but that is an optimization!). Tasks and shared slots are also optimizations not needed for proof of concept.

That's what I meant by saying it started as a hardware LMM. Everything else added is cake icing.

dMajo · 2013-12-09 01:57

cgracey wrote: »

I'm going to put a 4-line (x8 long) instruction cache into each cog. Running one hub task would work well. Running more than one task would thrash the cache. There will be a 1-line (x8 long) data cache in each cog for RDxxxxC. Along with Z/C/PC for each task, there will be a bit signifying whether hub mode is active. In hub mode, the conditional branches (DJNZ, JP, TJZ) will probably become bit8-extended relative branches.

I'm going to sleep. When I come back, I'll increase the program counters to 16 bits and make sure things still run. Then, I'll add the instruction cache.

If the amount of cache lines and a suitable algorithm is a concern in terms of used silicon can't you simply use cog registers for cache?

Can't you force registers from top to bottom as task1 to task4 caches? Supposing 2 caches per task is 2*4(*8)=64 longs; 4 caches per task is 128 longs over 512. If only one task is used in HUBEXEC this is 16 longs. I think that having one or more tasks (depending on user needs) that steals 16 long each from the internal cog is acceptable restriction. I think that giving 1 task is executing from hub, the others can live with 16 longs less.
This will reduce the cog memory available for local subroutines/code but perhaps running from hub it will not require so much of local storage. Doing this way you have all the needed cache memory (with the right number of ports and bus size) already in place and you can afford to a little more silicon to enforce some rules and for a good cache managing algorithm.

David Betz · 2013-12-09 03:58

jazzed wrote: »

Ugh.

Sorry, but I never heard of that. I watched Bill wave his hands and bend Chip's ear for hours about LMM RDLONGC though at UPEXPO.

I was referring to what I wrote in post 3202

To me the idea was for hardware to fetch and execute hub instructions without needing the COG to have code in it. That is the LMM interpreter hack necessary in P1 because it was an after-thought would not be necessary in this new chip.

The mechanics of it are fairly simple: Start the hub exec engine with a special cognew, and the hardware starts at the PC with stack and heap parameters all passed in PAR.

The hub fetch and exec engine sets PC based on the instructions and increments PC for us without requiring COG instructions (HARDWARE LMM). The COG could keep registers, etc.. though. When it needs more instructions, it gets enough to fill the instruction cache so that waiting for the next HUB fetch cycle is not necessary.

There would not be any time wasted by software needing to fetch more instructions - the hardware would do that. HUB mode instructions handle jumps and calls for LMM-like macros. In the simplest and most general mode, stack would live in HUB, and data would live in HUB. That should be the basic proof of concept.

Being able to load/store data with BIG is a great, simple and necessary optimization obviously gleaned from experience with LMM.

Anything else that is critical can be gleaned from existing LMM machines (as mentioned here). We build on what we know works!

In performance optimizations where recursion is not allowed stack might live in the AUX thing (but that is an optimization!). Tasks and shared slots are also optimizations not needed for proof of concept.

That's what I meant by saying it started as a hardware LMM. Everything else added is cake icing.

I guess I'm missing something. It sounds like you're describing the same mechanism we've been discussing, the COG fetches instructions directly from the hub using something similar to how RDLONGC works to avoid having to go to the hub for every instruction. New instructions (my LCALLx or Bill's HCALLx) are added to replace the LMM jump macros. BIG is used for 32 bit constants at a minimum and probably also for direct access to hub locations although the PTRx registers are probably better at that. In what way is your model different? I think the only complication that has been introduced recently is the idea of a 4 line cache and that was Chip's suggestion. I wouldn't have asked for that because I figured it was too much to hope for but it will certainly improve performance.

David Betz · 2013-12-09 04:04

dMajo wrote: »

If the amount of cache lines and a suitable algorithm is a concern in terms of used silicon can't you simply use cog registers for cache?

Can't you force registers from top to bottom as task1 to task4 caches? Supposing 2 caches per task is 2*4(*8)=64 longs; 4 caches per task is 128 longs over 512. If only one task is used in HUBEXEC this is 16 longs. I think that having one or more tasks (depending on user needs) that steals 16 long each from the internal cog is acceptable restriction. I think that giving 1 task is executing from hub, the others can live with 16 longs less.
This will reduce the cog memory available for local subroutines/code but perhaps running from hub it will not require so much of local storage. Doing this way you have all the needed cache memory (with the right number of ports and bus size) already in place and you can afford to a little more silicon to enforce some rules and for a good cache managing algorithm.

When I proposed the BIG instruction it was based on how an old processor that Eric and I worked on handled big constants. The idea of sharing local memory between direct access and cache was also a feature of that processor. The cores that ran C had larger local memories. I think at least one had up to 20k of memory local to the core. Some others that didn't have cache logic had smaller amounts as low as 4K. However, the cached cores had the ability to partition their local memory between true local memory where performance was deterministic and cache memory for access to the huge off-chip memory that was accessible on the two DMA busses. Partioning the COG into cache and as well as local memory would be similar. The only problem I can see with this approach on the COG is that the local memory is so small. You'd still need some local memory for registers if nothing else so you wouldn't even be able to use the entire 2K of COG memory for cache.

cgracey · 2013-12-09 04:55

Bill Henning wrote: »

You say many things I don't like, I say things you don't like. Discussion ensues, often better results are obtained due to discussion.

Good point, Bill.

If we can put our pride aside, or at least low-pass filter it, we can see everyday that things are moving in a good direction. This has been the consistent pattern here.

I am really grateful for everybody's involvement and interest in this project. You have all made the Prop2 into something way better than what I could have come up with on my own. I really love implementing all these great ideas we discuss into a harmonious design. This is a lot of fun! I'm sure that no other microcontroller has benefited from such broad input from so many varied experts before.

cgracey · 2013-12-09 05:06

dMajo wrote: »

If the amount of cache lines and a suitable algorithm is a concern in terms of used silicon can't you simply use cog registers for cache?

The problem is that cog RAM has one 32-bit write port that is used potentially every clock cycle by instructions. The cache lines can capture 8 longs per hub cycle without burdening any other resource.

ersmith · 2013-12-09 05:09

Bill Henning wrote: »

Pseduo-code for emitting gas for both cases, printf stye, exact syntax does not matter for discussion:

emit("BIG #%s\n>>9",bigaddr_symbol_name);
emit("djnz %s,#%s & $1ff",counter_name, symbol_name);

Other case

emit("SUB %s,#1\n",counter_name);
emit("if_nz hjmp #%s\n",symbol_name);

That's not how GCC's code generator works. The machine code generation is based on a pattern matching algorithm. There are a number of predefined patterns that GCC can understand: see http://gcc.gnu.org/onlinedocs/gccint/Standard-Names.html#Standard-Names. One of these is "decrement_and_branch_until_zero". If an instruction pattern exists in the target with that name, GCC will use it for loops; otherwise it will attempt to synthesize it with other instructions (in this case sub and then conditional branch). Technically you could argue that DJNZ is actually easier for GCC to emit, but since the machine-independent code already knows how to synthesize it from other instructions it's a wash for us.

Incidentally the list of standard names is a good starting point for figuring out what hardware features GCC can "easily" be made to support.

cgracey · 2013-12-09 05:17

David Betz wrote: »

I wonder how Chip is doing with hub execute mode?

I think it's coming together really well now. I've got all the PC's expanded to 16 bits and made the AUX-based CALL/RET instructions store a 16-bit PC value, along with Z and C, plus one bit that holds hub vs cog mode. You'll be able to call from hub to cog and vice-versa, since the caller's mode will be saved on CALL and restored on RET. I also used the BIG-constant idea to make an instruction which puts the PC value into the BIG buffer so that relative lookups can be done for branch tables, etc. There are some details I still need to work out before I get to implementing the actual cache lines, but I feel like it's a very natural extension of the architecture and not, at all, some last-minute kludge. Someone looking at it for the first time will think it was all designed from inception to work this way.

David Betz · 2013-12-09 06:41

cgracey wrote: »

I think it's coming together really well now. I've got all the PC's expanded to 16 bits and made the AUX-based CALL/RET instructions store a 16-bit PC value, along with Z and C, plus one bit that holds hub vs cog mode. You'll be able to call from hub to cog and vice-versa, since the caller's mode will be saved on CALL and restored on RET. I also used the BIG-constant idea to make an instruction which puts the PC value into the BIG buffer so that relative lookups can be done for branch tables, etc. There are some details I still need to work out before I get to implementing the actual cache lines, but I feel like it's a very natural extension of the architecture and not, at all, some last-minute kludge. Someone looking at it for the first time will think it was all designed from inception to work this way.

That's great news! Thanks for the update!

jazzed · 2013-12-09 08:27

David Betz wrote: »

I guess I'm missing something.

The first paragraph I quoted was written before this thread was started. And no, I'm not insulted.

I haven't had much to say because I thought everything was going rather swimmingly.

David Betz · 2013-12-09 08:31

jazzed wrote: »

The first paragraph I quoted was written before this thread was started. And no, I'm not insulted.

I haven't had much to say because I thought everything was going rather swimmingly.

I'll admit that I've not had time to read every post in detail. Things are moving really fast here! However, from Chip's reports it sounds like we will end up with something that is very nice. I suppose no one ever doubted that. :-)

Actually, I'm not really concerned with who said what when. I really just care that we end up with something that will work well and it looks like that will happen.

jazzed · 2013-12-09 08:33

cgracey wrote: »

I think it's coming together really well now. I've got all the PC's expanded to 16 bits and made the AUX-based CALL/RET instructions store a 16-bit PC value, along with Z and C, plus one bit that holds hub vs cog mode. You'll be able to call from hub to cog and vice-versa, since the caller's mode will be saved on CALL and restored on RET. I also used the BIG-constant idea to make an instruction which puts the PC value into the BIG buffer so that relative lookups can be done for branch tables, etc. There are some details I still need to work out before I get to implementing the actual cache lines, but I feel like it's a very natural extension of the architecture and not, at all, some last-minute kludge. Someone looking at it for the first time will think it was all designed from inception to work this way.

Sounds good chip. You do allow for using HUB as a stack too right? That will be necessary for larger programs.

Glad it seems to be a natural fit. Having to shoe-horn something in that makes a mess wouldn't be very comforting.

potatohead · 2013-12-09 08:34

Me too. I think the thrashing about is good actually. Chip has consistently shown he can read through that, interact with us, such state as we may be in at the time, and come away with the core ideas he needs to advance the design.

jazzed · 2013-12-09 08:41

David Betz wrote: »

I'll admit that I've not had time to read every post in detail. Things are moving really fast here! However, from Chip's reports it sounds like we will end up with something that is very nice. I suppose no one ever doubted that. :-)

Actually, I'm not really concerned with who said what when. I really just care that we end up with something that will work well and it looks like that will happen.

Yes, I'm thrilled that we will no longer need the LMM interpreter hack. And we won't have to answer embarrassing questions like "Interpreted? Well, why didn't you fix that?"

We won't have to waste memory on an interpreter anymore. We can still use an interpreter with Eric's compressed instruction set or Chip's spin byte-code for higher density.

I just hope the performance boost we are expecting is achievable.

David Betz · 2013-12-09 08:42

jazzed wrote: »

Sounds good chip. You do allow for using HUB as a stack too right? That will be necessary for larger programs.

I know this is controversial but I believe that a hub stack will be required for all but the tiniest C programs (what we currently call COG mode programs). It isn't only a feature to be used by "larger programs". This is because, as has been stated many times here, the AUX stack doesn't really work if you need to take the address of local variables which is often required even in small C programs. This is where a CALL/RET pair that uses a register to hold the return address rather than the AUX stack would be helpful. It isn't essential of course since we could use the CALLA/CALLB version of the instruction (LCALLA or HCALLA or whatever) and then just pop the return address off the AUX stack and store it on the hub stack. It would be more efficient though to be able to just call a function and have the return address stored in a register if that ends up being possible:

    ' in setup code
    SETLR lr

    ' later in generated code
    LCALL_LR #my_fcn

     ' more code

my_fcn
    ' function body
    LRET_LR

The SETLR instruction would set the register to use as the link register. The LCALL_LR instruction would store the return address in that register and the LRET_LR instruction would return to the address stored in the link register. I'm not attached to these names by the way. In fact, I hate them and would welcome better suggestions! :-)

This maps directly to the way PropGCC currently handles calling functions.

Bill Henning · 2013-12-09 09:22

Steve,

Minor correction for the sakes of technical accuracy.

LMM is a hack (in the non-deragotry meaning of the word), if a cute one, that was need to make large model programs possible.

"LMM interpreter" is not correct. See http://en.wikipedia.org/wiki/Interpreter_%28computing%29

An interpreter requires... umm... interpreting, and would require a jump table with an entry for each instruction, and a routine to execute it.

Or a massive case statement, or a very very long if/else chain - none of which would fit in a cog.

The LMM fetch/excute loop does no interpreting whatsoever on the native instruction.

Like everyone else, I am delighted to have hubexec replace it.

jazzed wrote: »

Yes, I'm thrilled that we will no longer need the LMM interpreter hack. And we won't have to answer embarrassing questions like "Interpreted? Well, why didn't you fix that?"

We won't have to waste memory on an interpreter anymore. We can still use an interpreter with Eric's compressed instruction set or Chip's spin byte-code for higher density.

I just hope the performance boost we are expecting is achievable.

David Betz · 2013-12-09 09:24

Bill Henning wrote: »

LMM is a hack (in the non-deragotry meaning of the word), if a cute one, that was need to make large model programs possible.

That's a very good way to put it. LMM is a hack in the very best sense of the word, a clever way to get around a difficult problem!

potatohead · 2013-12-09 09:25

I think LMM is best described as a virtual machine. It's a simple one. Hybrid of native execute and interpreted, depending on what the LMM kernel actually does with program flow and or software defined instructions.

Awesome hack too. I still remember that first post Bill. Read it, thought for a moment, then "oh yeah! sweet! Big programs, here we come!"

jazzed · 2013-12-09 09:27

LOL.

Call it what you will Bill, the LMM kernel has to interpret jumps, etc.... I'm not trying to offend you.

Bill Henning · 2013-12-09 09:28

David,

of course

' in generated code

HCALL #my_fncn

... much code later ...

my_fcn
... more code
HRET

would work in small/medium cases, and in large cases

my_fcn
popa LR
HJMP #LR ' would also work

However,

if you will recall our earlier conversation, HCALL / HRET ... HCALLA / HRETA ... HCALLB / HRETB addresses your concerns

with the advantage of not having to do

my_fcn
PUSHA LR ' or even slower WRLONG LR, --SP
HCALL #fn2
POPA LR ' or even slower RDLONG LR,++SP
HRET LR

in large programs that do not need a large stack.

Simplest way of keeping everyone happy is to have both versions, so the appropriate version can be used for the appropriate code.

I am not opposed to having the LR variants, so I don't see why you are constantly opposing to the stack variants.

If you think it will save logic, I could use the same argument say that using only the cog stack variant would also save logic.

David Betz wrote: »
I know this is controversial but I believe that a hub stack will be required for all but the tiniest C programs (what we currently call COG mode programs). It isn't only a feature to be used by "larger programs". This is because, as has been stated many times here, the AUX stack doesn't really work if you need to take the address of local variables which is often required even in small C programs. This is where a CALL/RET pair that uses a register to hold the return address rather than the AUX stack would be helpful. It isn't essential of course since we could use the CALLA/CALLB version of the instruction (LCALLA or HCALLA or whatever) and then just pop the return address off the AUX stack and store it on the hub stack. It would be more efficient though to be able to just call a function and have the return address stored in a register if that ends up being possible:
    ' in setup code
    SETLR lr

    ' later in generated code
    LCALL_LR #my_fcn

     ' more code

my_fcn
    ' function body
    LRET_LR
The SETLR instruction would set the register to use as the link register. The LCALL_LR instruction would store the return address in that register and the LRET_LR instruction would return to the address stored in the link register. I'm not attached to these names by the way. In fact, I hate them and would welcome better suggestions! :-)

This maps directly to the way PropGCC currently handles calling functions.

Hub Execution Model Thread (split from blog)

Comments