Hub Execution Model Thread (split from blog)

David Betz · 2013-12-07 16:44

rogloh wrote: »

Just wanted to mention a bit earlier when there were discussions going on about the number of tasks that could run in hub exec mode on a single COG. Even if you have 4 different PCs (one per task) you would still need 4 stack pointers as well so when the task context switched you would get your correct SP. That is beginning to getting trickier to manage, especially if some software use PTRA, some PTRB, SPA, SPB etc for their stacks. Limiting to a single hub exec task limit per COG seems okay to me, however it would be nice to be able to mix it with normal COG tasks. You could even then write your own fine grained scheduler as one special COG task that runs say every 16 cycles and can choose to switch out the hub exec PC, and SP wheen some elapsed time being compared has occurred. That approach could also effectively allow multiple hub exec tasks per COG.

Good point about needing multiple stack pointers.

As to the self modifying code aspect in hub exec mode with MOVS, MOVI and MOVD etc I would have to suspect it is unlikely that high level language compilers like GCC for example would ever try to do anything like that on instruction code they generated. I am talking modifying of code, not data. But David may know more about that. However a user might try it manually so it is certainly important to recognize/document this limitation.

Right. The GCC code generator isn't likely to do this. In fact, it couldn't really do this in LMM, CMM, or XMM modes even on P1.

David Betz · 2013-12-07 16:46

cgracey wrote: »

Oh, boy! That didn't occur to me. Maybe I should leave the PC's at 9 bits and have a special one-off 16-bit PC that can be assigned to whatever task wants to execute from the hub. That way, the 4 instruction cache lines will always work well, and not get spread too thin. I think that's what I'll do.

Thanks for thinking about these things, Guys.

Ummm... How will you handle transitions between hub mode and COG mode if you don't have a single PC that can address either hub or COG memory? Or are you saying that the 16 bit PC will overlay the 9 bit one on whatever task first enters hub mode?

jazzed · 2013-12-07 16:51

Will the PC be visible to the program?

We have used add/sub PC for short relative branches.

jmg · 2013-12-07 16:54

jazzed wrote: »

Will the PC be visible to the program?
We have used add/sub PC for short relative branches.

Sounds like a good idea.

David Betz · 2013-12-07 17:00

jazzed wrote: »

Will the PC be visible to the program?

We have used add/sub PC for short relative branches.

Good point. The current LMM code generator certainly uses these relative branches. I guess those could be replaced by the relative instructions that have been proposed for hub mode.

cgracey · 2013-12-07 17:32

I think it's going to have to work by simply switching from the 9-bit cog PC to the 16-bit hub PC for whatever task runs in the hub. That's all I can figure at the moment. And, yes, there will need to be some way to do math on that 16-bit PC.

rogloh · 2013-12-07 17:42

That is where using PTRA for hub PC was nice as would let you use getptra, setptra instructions etc, and also do add/sub operations on it. However if you modified it that would need to be detected to potentially block and trigger a new instruction prefetch operation.

Bill Henning · 2013-12-07 21:41

sub counter,#1 wz
if_nz hjmp #bigaddress

MUCH simpler than having djnz prefixed, easier for gcc to hande

David Betz wrote: »

I'd recommend leaving DJNZ and friends alone and letting them address the larger hub memory by using the BIG instruction. Narrow range relative branches may be difficult to use by the GCC code generator.

Bill Henning · 2013-12-07 21:42

HJMP/HCALL/HCALLA/HCALLB use an embedded 16 bit constant (plus two implied low zero bits) to reach the full 256KB hub address space with one long - see post #2

using two longs would waste a lot of memory and cycles.

ctwardell wrote: »

How much of the instruction set would really need changed? CALL and JMP obviously. I assume hub mode in general will be somewhat 'LMM like' in that some portion of the COG will be treated as registers and those will usually be the source and destination for most commands.

I haven't read all of the concepts on the hub execution and won't pretend to fully understand them yet, but what if there was BIGA for absolute and BIGR for relative. Whichever one was used directly before an instruction would determine if the value was treated as absolute or relative.

C.W.

Bill Henning · 2013-12-07 21:45

Why do it that way?

The HJMP in #1 and #2 supports the full conditial executions, and could use C and Z to encode relative jumps.

Absolutely zero need for a three-long relative jump around a long jump.

BIG DJNZ would take two longs, two cycles.

SUB counter,#1 wz
if_nz HJMP #bigaddress

two longs, two cycles - and simpler

David Betz wrote: »

A relative branch around a BIG jmp will take three longs. The BIG DJNZ will only take two longs.

Bill Henning · 2013-12-07 21:47

Guys?

relative cog jumps - I like the concept - but that is deep restructuring, perhaps too deep for right now?

And no need for it.

David Betz wrote: »

jmg wrote: »

It seems a little risky making an Opcode flip how it behaves, based on a RAM bit ?/QUOTE]
I agree completely! I guess the COG mode DJNZ could be relative as well. It would make it harder to use with MOVS though since you'd have to compute the relative address rather than just stuffing in the value of a label.

Bill Henning · 2013-12-07 21:49

Agreed.

No need for it, as

sub counter,#1 wz
if_nz hjmp #address

is clear, and avoids messes.

For p3, instruction set encoding re-designed, maybe 64 bit... who knows?

Cluso99 wrote: »

David Betz wrote: »

Yes, I agree it will be confusing. Perhaps my explanation was wanting - I would rather DJNZ not be relative in Hubexec mode either (keep it consistent).

Bill Henning · 2013-12-07 21:50

I like this better. Simpler.

Frankly, I'd be perfectly happy if a hubexec cog could not have other tasks running in it. If you can fit 1-3 other cog tasks - great. If not easy, forget it for P2.

cgracey wrote: »

Oh, boy! That didn't occur to me. Maybe I should leave the PC's at 9 bits and have a special one-off 16-bit PC that can be assigned to whatever task wants to execute from the hub. That way, the 4 instruction cache lines will always work well, and not get spread too thin. I think that's what I'll do.

Thanks for thinking about these things, Guys.

Bill Henning · 2013-12-07 21:51

I wish everything was visible

jmg wrote: »

Sounds like a good idea.

Bill Henning · 2013-12-07 21:52

Sorry for the flurry of posts guys!

I was trying to catch up after spending the day with family.

David Betz · 2013-12-08 03:16

Bill Henning wrote: »

sub counter,#1 wz
if_nz hjmp #bigaddress

MUCH simpler than having djnz prefixed, easier for gcc to hande

How is that easier than a BIG version of DJNZ? Also, it clobbers Z. Also, please stop saying what you think GCC can handle and what it cannot handle. As far as I know you haven't been involved in any of the GCC code generator work.

David Betz · 2013-12-08 03:18

Bill Henning wrote: »

HJMP/HCALL/HCALLA/HCALLB use an embedded 16 bit constant (plus two implied low zero bits) to reach the full 256KB hub address space with one long - see post #2

using two longs would waste a lot of memory and cycles.

What happens when we have more than 256k of hub memory in some future chip? We change the instruction set again? Shouldn't a CPU architecture be designed to be useful for more than one generation? This is one thing that has bothered me about the Propeller from the start. The very first version of the chip was maxed out with respect to the amount of COG memory it can support.

evanh · 2013-12-08 03:27

David Betz wrote: »

The very first version of the chip was maxed out with respect to the amount of COG memory it can support.

I don't think comparing the size of the general register set to a memory model is apples with apples. Or did you mean amount of Hub-RAM supported?

David Betz · 2013-12-08 03:44

evanh wrote: »

I don't think comparing the size of the general register set to a memory model is apples with apples. Or did you mean amount of Hub-RAM supported?

I'm talking about the size of the register set because in the case of the COG that is also the size of program memory.

David Betz · 2013-12-08 03:56

cgracey wrote: »

Oh, boy! That didn't occur to me. Maybe I should leave the PC's at 9 bits and have a special one-off 16-bit PC that can be assigned to whatever task wants to execute from the hub. That way, the 4 instruction cache lines will always work well, and not get spread too thin. I think that's what I'll do.

Thanks for thinking about these things, Guys.

I was thinking about this and while it's true you only have two of the fancy new PTRA/PTRB registers, you also have the register remapping that you added to support tasking so you could put the stack and frame pointers in regular registers (like is done in PropGCC now for example). This could allow you to run four hub mode tasks. Of course you still have the problem of cache thrashing if you don't have a separate cache per task and you don't get full advantage of the new instruction set by bypassing the use of the new PTR registers but it should work and would allow the same code to be run from multiple hub mode threads without register conflicts.

evanh · 2013-12-08 04:16

David Betz wrote: »

I'm talking about the size of the register set because in the case of the COG that is also the size of program memory.

Heh, the one part that hasn't changed between P1 and P2 ... It's one of those trade-offs for multi-core designers, local vs global resources. Wasn't easy to conceptualise instruction fetching from Hub.

Hmm, does this mean that, in HEM mode, the first time slot of the pipeline is always bypassing the dedicated port for Cog-RAM instruction fetch?

evanh · 2013-12-08 04:20

David Betz wrote: »

... This could allow you to run four hub mode tasks. Of course you still have the problem of cache thrashing if you don't have a separate cache per task and you don't get full advantage of the new instruction set by bypassing the use of the new PTR registers but it should work and would allow the same code to be run from multiple hub mode threads without register conflicts.

The main point of the four hardware threads like that was for tight low level code. I don't see a need for it at the kernel tasking level - which can implement multitasking in a more relaxed manner.

David Betz · 2013-12-08 04:28

evanh wrote: »

The main point of the four hardware threads like that was for tight low level code. I don't see a need for it at the kernel tasking level - which can implement multitasking in a more relaxed manner.

I wasn't talking about the kernel. We don't even have a kernel do we? :-)

I was talking about Chip's original proposal to extend all four PCs to 16 bits. I'm just saying that that could work if you use the register mapping feature and avoid PTRA/PTRB as stack pointers.

evanh · 2013-12-08 04:33

I was being a tad liberal with the term kernel - viewing the HEM as the core of a nominally kernel driven environment, even if it's just built into the language.

rogloh · 2013-12-08 05:59

David Betz wrote: »

... I'm just saying that that could work if you use the register mapping feature and avoid PTRA/PTRB as stack pointers.

The key thing here is that having PTRA/PTRB/SPA/SPB for a stack pointer already allows indexed addressing which could work out rather well for reading variables on stack frames. It would be nice to be able to take advantage of that feature if possible, otherwise you may need to compute local variable addresses for each local access relative to some COG register variable per task which would not be good for performance.

David Betz · 2013-12-08 06:10

rogloh wrote: »

The key thing here is that having PTRA/PTRB/SPA/SPB for a stack pointer already allows indexed addressing which could work out rather well for reading variables on stack frames. It would be nice to be able to take advantage of that feature if possible, otherwise you may need to compute local variable addresses for each local access relative to some COG register variable per task which would not be good for performance.

That is certainly true. I was just pointing out that it would be *possible* to have four hub tasks. However, the mechanism for doing that wouldn't be ideal. I guess it's a moot point though since the cache thrashing that would result would probably not be worth it.

Bill Henning · 2013-12-08 07:12

David Betz wrote: »

How is that easier than a BIG version of DJNZ?

No need for the verilog for DJNZ to support BIG. Easier.

David Betz wrote: »

Also, it clobbers Z.

Irrelevant, Z cannot matter to code outside of the loop, for some expression inside it.

David Betz wrote: »

Also, please stop saying what you think GCC can handle and what it cannot handle. As far as I know you haven't been involved in any of the GCC code generator work.

1) You say many things I don't like, I say things you don't like. Discussion ensues, often better results are obtained due to discussion.

2) Every time I suggest an optimization, I always get a response that boils down to "that is too difficult to support in gcc", "it would be too much work", "it does not fit the gcc world view"

3) Saying that the two instruction sequence is easier for GCC to handle is valid.

Regarding GCC code generator work. I've build quite a few code generators, therefore it is valid for me to comment on how they work in general, because gcc's must work in a similar manner. Something that was verified when I read the code generator Eric built.

If gcc cannot do what other code generators can, that is an issue with gcc's code generator.

Bill Henning · 2013-12-08 07:20

Then we have a new encoding. Once we have more hub, it makes far more sense to move the address to a subsequent long, and not have to worry about assembling addresses from a BIG and a small constant.

Yes, we are guaranteed that there will be some instruction set changes every time the core is rev'ed.

The issue is that we are bolting on large addresses to an instruction set that was designed for small addresses. For P3, it would be best to do a "clean sheet" re-design of instruction encoding - we do not have the luxury of time for that; also with a 256KB hub, it would be idiotic to have two long (eight byte) jump/call instructions whenever it can be avoided as it would waste too much of the precious limited hub memory.

Basically, you don't get the Propeller 1's architecture then (or at least don't agree with Chip's choice in how to design it) but that is history - it was not designed for larger programs, which is why LMM engine (which is not an interpreter) was necessary to run them.

David Betz wrote: »

What happens when we have more than 256k of hub memory in some future chip? We change the instruction set again? Shouldn't a CPU architecture be designed to be useful for more than one generation? This is one thing that has bothered me about the Propeller from the start. The very first version of the chip was maxed out with respect to the amount of COG memory it can support.

ctwardell · 2013-12-08 07:44

Bill Henning wrote: »

HJMP/HCALL/HCALLA/HCALLB use an embedded 16 bit constant (plus two implied low zero bits) to reach the full 256KB hub address space with one long - see post #2

using two longs would waste a lot of memory and cycles.

I see now that you have already provided for relative JMP and CALL, that's good.

The comments in 374/376 made me think everything was going to be absolute.

C.W.

David Betz · 2013-12-08 08:15

Bill Henning wrote: »

No need for the verilog for DJNZ to support BIG. Easier.

My understanding is that BIG will extend any S field so making it not work with DJNZ would require a special case. Working with DJNZ is the default.

Irrelevant, Z cannot matter to code outside of the loop, for some expression inside it.

That is probably true. I was just pointing out a difference between the two that could matter. Perhaps it never will.

1) You say many things I don't like, I say things you don't like. Discussion ensues, often better results are obtained due to discussion

2) Every time I suggest an optimization, I always get a response that boils down to "that is too difficult to support in gcc", "it would be too much work", "it does not fit the gcc world view"

3) Saying that the two instruction sequence is easier for GCC to handle is valid.

Regarding GCC code generator work. I've build quite a few code generators, therefore it is valid for me to comment on how they work in general, because gcc's must work in a similar manner. Something that was verified when I read the code generator Eric built.

If gcc cannot do what other code generators can, that is an issue with gcc's code generator.

I don't object to you making comments about what would be easy or hard for you to implement in a code generator. I object to specific references to GCC since I am not aware that you have any experience with it. If you notice, I don't usually make definitive statements about how hard things would be with GCC but usually defer to Eric who has done most of the code generator work. I sometimes speculate as to what I think might be easy or hard but I don't feel like I'm in a position to make definite statements and I have worked in the PropGCC code generator at least a little. Like you, I've also worked on numerous compilers over the years so I have at least a basic understanding of how they work.

Hub Execution Model Thread (split from blog)

Comments