Hub Execution Model Thread (split from blog)

potatohead · 2013-12-11 08:13

Are we talking about starting a cog in HUBEX mode given only a HUB address? If so, OK, where does that lead?

If we are talking about having COGS run from HUB memory all the time, I think that's a mistake. The separate COG memory space has some serious advantages when COGS are in COGEX mode.

@JMG

Even with a changing core
a) Code written in a HLL, will gain automatically as compilers improve
b) Hahd crafted library code gives a good starting point, for further tuning, so is never really wasted.

Well, first off, the HLL isn't available at the moment, so let's put that aside from some basic SPIN 2 statements.

As to b, I've got some basic code written for various things, and you are right about that. I bristle a little about not changing something because some code may be invalidated. If the code really makes sense, that's a good check, but the work that went into it isn't necessarily the same check, that's all.

Bill Henning · 2013-12-11 08:15

Makes sense to keep the pipeline as short as possible, and with the addition of relative instructions, there is no further significant benefit to exposing the PC (and we avoid the additional pipeline stage!).

cgracey wrote: »

I spent all day getting the PC mapped into $1F0/$1F1, only to realize that it extended the critical path of the whole chip and slowed it way down. The reason is that the computed ALU result is the last-arriving signal set, and to run it through a few more sets of mux's to accommodate the four task PC's, and then get it out to the cog RAM instruction address input, just takes too long. The only way to circumvent these delays is to add another pipeline stage, which will make cancelling branches take one more clock, and 4-way multitasking branches take two clocks, instead of one. It's not worth it. So, the PCs will have to be addressed by instructions, only, in which the PC result does not go through the main ALU. It was worth trying, though, because the benefits would have been great. I think to compensate, I'll make relative jumps, which are easy to implement without drawbacks. This will give us the same performance we would have had with mappable PC's, when it comes to adding to them.

Bill Henning · 2013-12-11 08:20

That sounds really, really good - I can't wait to see the docs and try it!

cgracey wrote: »

I've got the master plan all worked out now for hub execution. It's taken a few days of thinking to get it all sorted out, but it turned out to be pretty simple, and the programmer will not be burdened with lots of strange branch instructions. Code that runs in the hub is written with the same branch instructions as code that runs in the cog. Each branch instruction has a special version which toggles between hub/cog mode for the branch destination. All stack saves incorporate this mode bit. RETurns restore the caller's hub/cog mode. So, you can call from hub to cog and vice-versa with the same set of instructions. It's really easy to use. When in hub mode, all those DJNZ/JP/etc. branches becomes relative, if immediate. The assembler will know by a directive whether it is assembling hub or cog code. The context between hub and cog modes is very fluid. I think everyone's going to be very happy.

All task PC's have been expanded from 9 to 16 bits, to accommodate hub addressing. It was natural to do it this way, since the PC's have all the pipeline mechanisms built into them. I even added a 4-bit field to JMPTASK's PC mask which will allow you to launch hub threads directly, without even starting them in the cog. It's going to be sheer simplicity.

evanh · 2013-12-11 13:24

potatohead wrote: »

If we are talking about having COGS run from HUB memory all the time, I think that's a mistake. The separate COG memory space has some serious advantages when COGS are in COGEX mode.

Absolutely, as I understand it, the performance of Cog-exec is still notably better than Hub-exec particularly when data has to flow to/from Hub.

The idea of directly starting a Cog in hubexec mode doesn't remove the ability to flip between execution modes. For me, it was purely an idea to allow instant (re)starting of a Cog without the need to preload it. And as Seairth has pointed out (My coding experience level on the Prop is basically nil), it's not hard to still block fill a Cog then execute that natively.

Cluso99 · 2013-12-11 16:46

Starting the cog in hubexec mode...

What might be more useful is being able to start the cog with a minimal cog ram load. Something like giving it the hub address with a length, in addition to the "PAR" value. In addition to this, the ability to "reset" the cog or "not reset" the cog could be useful.

Think of having the cog be ready for doing some short intensive tasks and then going back to idle. Here you want it to be fast, and a full cog load is likely not desired.

It might also be useful to have the hubaddr and par values that are put into PTRA & PTRB be 32 bits. I have suggested this before, but at the time there was no time. It would be quite useful to be able to pass 32 bits of data via "PAR". It would also be useful to pass 32 bits via the HubAdr (by setting upper bits which would be ignored by coginit/cognew).

Cluso99 · 2013-12-11 16:57

re D & S pointing to a contiguous address space
$000-1FF = cog
$200-3FF = aux
$400-3FFFF = hub

Yes, I know this skips the lowest 4KB of hub ROM (and little bit of ram). But it does become extremely useful.
All instructions could be used to access cog/aux/hub - think AND/XOR/ROL etc. Yes, there would be additional clocks inserted for hub access, and perhaps aux access. For the benefits of being able to utilise aux and hub as variable space, this would be a great addition.

I have no idea if this is
(a) feasible
(b) simple of complex

Software to utilise this would be easy
(a) use a cog pointer eg XOR dptr, sptr
(b) Use immediate D and/or S, and prefix the instruction using AUGS or AUGD (the old BIG instruction)

Access to the hub $000-3FF could be achieved by using $40000-403FF

I am just throwing this out there for Chip to think about, in context with the other mechanisms.

jmg · 2013-12-11 17:07

Cluso99 wrote: »

re D & S pointing to a contiguous address space
$000-1FF = cog
$200-3FF = aux
$400-3FFFF = hub

Yes, I know this skips the lowest 4KB of hub ROM (and little bit of ram). But it does become extremely useful.

This becomes like many micros, which can overlay Register memory, with longer opcode (DATA) memory.
It is therefore feasible, question is, does it have any significant time/silicon costs ?

David Betz · 2013-12-11 17:13

jmg wrote: »

This becomes like many micros, which can overlay Register memory, with longer opcode (DATA) memory.
It is therefore feasible, question is, does it have any significant time/silicon costs ?

There might not be enough ports on the AUX memory to support both source and destination addresses.

jmg · 2013-12-11 17:41

David Betz wrote: »

There might not be enough ports on the AUX memory to support both source and destination addresses.

Good point, I missed that the general case many have been intended.
When small uC overlay like this, the opcodes available are reduced, usually to just load/store, so I assumed that was implied here too.

Cluso99 · 2013-12-11 18:39

David Betz wrote: »

There might not be enough ports on the AUX memory to support both source and destination addresses.

I am sure there will be ramifications. It may take more clocks to execute an instruction. But as long as there is no impact to the cog mode, and it is simple to do, then it may be worth doing. Only Chip will know for sure.

I asked a little while ago about having aux index registers like INDA/INDB. Chip said it was a big task which I accept.
But thinking laterally, could INDA/INDB be used along the same lines... ie INDA/INDB addresses $200-3FF being aux ram?

David Betz · 2013-12-11 18:43

Cluso99 wrote: »

I am sure there will be ramifications. It may take more clocks to execute an instruction. But as long as there is no impact to the cog mode, and it is simple to do, then it may be worth doing. Only Chip will know for sure.

That is certainly true. The rest of us can only guess. At least that's true for me anyway. Some day we may get a chance to see the COG Verilog code and then maybe we'll understand the magic that Chip has been doing a little better.

rogloh · 2013-12-11 19:00

Cluso99 wrote: »

re D & S pointing to a contiguous address space
$000-1FF = cog
$200-3FF = aux
$400-3FFFF = hub

Yes, I know this skips the lowest 4KB of hub ROM (and little bit of ram). But it does become extremely useful.
All instructions could be used to access cog/aux/hub - think AND/XOR/ROL etc. Yes, there would be additional clocks inserted for hub access, and perhaps aux access. For the benefits of being able to utilise aux and hub as variable space, this would be a great addition.

I have no idea if this is
(a) feasible
(b) simple of complex

Software to utilise this would be easy
(a) use a cog pointer eg XOR dptr, sptr
(b) Use immediate D and/or S, and prefix the instruction using AUGS or AUGD (the old BIG instruction)

Access to the hub $000-3FF could be achieved by using $40000-403FF

I am just throwing this out there for Chip to think about, in context with the other mechanisms.

While it could be potentially useful for a larger data space, that appears to me at least to be a major change for the P2 register addressing architecture.

What might be simpler and useful is some additional variant of RDLONG and WRLONG that could detect low addresses and access the 768 longs in COG/AUX RAM if the long address is below $300 (or $400), or hub RAM otherwise. This would give a nice way to dereference pointers and access stack variables independent of whether they are in hub memory or internal COG memory(s). David might like this feature for GCC too. Or you might be able to make RDLONG/WRLONGs work that way all the time when in hub mode and not need another variant.

Due to 32 bit nature of the internal COG memories, accesses might have to be limited to 32 bit accesses (or perhaps smaller), instead of by quads/octs etc, but given you can only index and push/pop 32 bit values into the AUX RAM today, it may not be that big of a deal that way.

Cluso99 · 2013-12-11 19:12

David Betz wrote: »

That is certainly true. The rest of us can only guess. At least that's true for me anyway. Some day we may get a chance to see the COG Verilog code and then maybe we'll understand the magic that Chip has been doing a little better.

I am unsure how much I will ever understand, but it would sure be nice to try. However, I am concerned that he doesn't release the whole pudding until the IP thing is thoroughly understood. Would hate to see it taken by a company with far more $.

David Betz · 2013-12-11 19:24

Cluso99 wrote: »

I am unsure how much I will ever understand, but it would sure be nice to try. However, I am concerned that he doesn't release the whole pudding until the IP thing is thoroughly understood. Would hate to see it taken by a company with far more $.

I agree!

cgracey · 2013-12-11 21:37

David Betz wrote: »

That is certainly true. The rest of us can only guess. At least that's true for me anyway. Some day we may get a chance to see the COG Verilog code and then maybe we'll understand the magic that Chip has been doing a little better.

The main cog Verilog file is 4,500 lines long and doesn't include the hardware peripherals. Right now, the whole thing is spooled up in my head. If I were to go away for a few weeks, it would take another month to get back to where I am at the moment. I can make changes quickly now because I've been immersed in it very intensely for several months and I know all its details. When I get back to the silicon side, I will loose this momentum, and when I return again, it will take a long time to become as proficient as I am right now. This is the time to improve the Verilog code. There may not be another opportunity for a long time.

David Betz · 2013-12-12 04:55

cgracey wrote: »

The main cog Verilog file is 4,500 lines long and doesn't include the hardware peripherals. Right now, the whole thing is spooled up in my head. If I were to go away for a few weeks, it would take another month to get back to where I am at the moment. I can make changes quickly now because I've been immersed in it very intensely for several months and I know all its details. When I get back to the silicon side, I will loose this momentum, and when I return again, it will take a long time to become as proficient as I am right now. This is the time to improve the Verilog code. There may not be another opportunity for a long time.

I can understand that completely and I'm sure many others here can as well. We get immersed in a project and have a deep knowledge of all aspects of "the code" whether it be Verilog or C or Spin or assembly language. But when time has passed it takes a long time to get back to that level of familiarity. We're fortunate that you were able to dive back into this and add some very useful features that I'm sure will make the P2 a better product.

Sapieha · 2013-12-13 10:59

Hi Chip.

I have one question?.

Will HUB code be possible to execute as relocatable code else need it always have static address?

evanh · 2013-12-13 14:00

Sapieha wrote: »

Will HUB code be possible to execute as relocatable code else need it always have static address?

PC-Relative code will be what's meant.

cgracey · 2013-12-13 18:57

Sapieha wrote: »

Hi Chip.

I have one question?.

Will HUB code be possible to execute as relocatable code else need it always have static address?

There will be both relative and absolute immediate jumps and calls.

pedward · 2013-12-14 14:24

cgracey wrote: »

The main cog Verilog file is 4,500 lines long and doesn't include the hardware peripherals. Right now, the whole thing is spooled up in my head. If I were to go away for a few weeks, it would take another month to get back to where I am at the moment. I can make changes quickly now because I've been immersed in it very intensely for several months and I know all its details. When I get back to the silicon side, I will loose this momentum, and when I return again, it will take a long time to become as proficient as I am right now. This is the time to improve the Verilog code. There may not be another opportunity for a long time.

Only 4500 lines? That's surprisingly short. I've been responsible for projects that had numerous single source files with > 5000 lines.

My SMTP mail server was something like 14k or 19k LOC, don't remember exactly how much.

The process/config manager I wrote for a monitoring tool was around 750 lines, the most compact code I think I wrote for such a purpose, it was C++.

Higher level languages afford much less LOC, which is both a good and bad thing. Good because there are less moving parts in your code, bad because if there is a major problem not in your code, it's someone else' moving parts. :blank:

Heater. · 2013-12-14 14:50

I'm inclined to think this less code the better. Or shall we say, it should be as big as it needs to be and no more. Code does not have to be big to be Earth shatteringly brilliant and useful. Think FFT for example.

I was also wondering what to make of that number.

It has often been said that a programmer produces about 10 lines of production code per day on average. Which sounds very small but when you include all the design, testing, documentation, bug hunting and "maintenance" begins to sound about right. Not to mention all the research, exploration and skills acquisition one might have to do up front. In which case that 4500 lines should have taken a tad over two years to perfect.

Then of course Chip has also had other parts of the P2 design to work on and the updating of language tools etc.

Ultimately that code size is limited by the number of gates you can get on the chip, like normal software size is limited by memory. So factoring functionality and removing code also takes time.

All in all it starts to sound about right. Mind you I really don't have a feel for writing what is highly parallel functionality in an HDL as compared to normal sequential software.

cgracey · 2013-12-14 16:56

Heater. wrote: »

I'm inclined to think this less code the better. Or shall we say, it should be as big as it needs to be and no more. Code does not have to be big to be Earth shatteringly brilliant and useful. Think FFT for example.

I was also wondering what to make of that number.

It has often been said that a programmer produces about 10 lines of production code per day on average. Which sounds very small but when you include all the design, testing, documentation, bug hunting and "maintenance" begins to sound about right. Not to mention all the research, exploration and skills acquisition one might have to do up front. In which case that 4500 lines should have taken a tad over two years to perfect.

Then of course Chip has also had other parts of the P2 design to work on and the updating of language tools etc.

Ultimately that code size is limited by the number of gates you can get on the chip, like normal software size is limited by memory. So factoring functionality and removing code also takes time.

All in all it starts to sound about right. Mind you I really don't have a feel for writing what is highly parallel functionality in an HDL as compared to normal sequential software.

In normal software projects, it seems to go three steps forward, two steps back. In hardware, where there are silicon size issues and ramifications of how it will run software, it can become more like 1 step forward, 0.9 steps back. There is a ton of rip up and redo.

cgracey · 2013-12-16 04:46

I had said earlier that I thought hub execution would be about 80% of the native cog rate.

Pedward and I talked over the weekend, as he had been thinking about caching and hit rates, and I think now that it's going to be lower than 80%, maybe even 50%. The reason is that otherwise-contiguous code is almost always punctuated by calls and branches, which destroy caching efficiency pretty quickly. It almost makes me think that there may not be much value beyond 1 cache line. The plan is to implement 4 lines, which would be necessary to get any kind of performance out of four hub tasks. We'll see soon, hopefully, how well it works.

Bill Henning · 2013-12-16 05:37

With four cache lines there is a decent probability that when returning from a call that there will be useful instructions in the cache; also loops up to four cache lines in length would be much faster.

The hit rate will depend on the type of code being executed. There is a "diminishing returns" factor with caches as well, however four lines will perform better than one.

Without four lines, running all four tasks in hub mode would result in a line reload every instruction, which would be really slow.

cgracey wrote: »

I had said earlier that I thought hub execution would be about 80% of the native cog rate.

Pedward and I talked over the weekend, as he had been thinking about caching and hit rates, and I think now that it's going to be lower than 80%, maybe even 50%. The reason is that otherwise-contiguous code is almost always punctuated by calls and branches, which destroy caching efficiency pretty quickly. It almost makes me think that there may not be much value beyond 1 cache line. The plan is to implement 4 lines, which would be necessary to get any kind of performance out of four hub tasks. We'll see soon, hopefully, how well it works.

pedward · 2013-12-16 18:20

I read some of Roy's comments on the GCC code generator and leaf functions.

He said that in his work, he profiled a lot of generated code and found that there were a lot of small, terminal, leaf functions that were small in size, frequently with prologue and epilogue removed.

This was the feeling I had and asserted when talking with Chip. My assertion was that 2 cache lines were necessary per thread, to accommodate the caller and callee (leaf). There isn't enough space to accommodate 2 cache lines per task, so in my mind it makes no sense to support hubex for multi-threading/tasks.

If Chip decides to implement 4 cache lines, with a single thread of execution, that should work fine with an LRU algorithm.

If code doesn't do hubex, I'd like to see those 4 cache lines generally re-purposeable for PASM code, otherwise they are an architectural feature that is only useful to GCC code, and a huge waste of resources.

I suggested that perhaps the COG should be able to do scatter-gather reads from HUB RAM, telling the DMA that it wants to load a number of addresses in an order. If those were mapped like the QUADs are, you could write some nice code that could DMA from HUB RAM, and let the COG do realtime signal processing, etc.

Each cache line is 8 longs, or 32 bytes, times 4 that's 128 bytes. It would be cool to be able to DMA 128byte chunks for some things.

David Betz · 2013-12-16 18:30

pedward wrote: »

I read some of Roy's comments on the GCC code generator and leaf functions.

He said that in his work, he profiled a lot of generated code and found that there were a lot of small, terminal, leaf functions that were small in size, frequently with prologue and epilogue removed.

This was the feeling I had and asserted when talking with Chip. My assertion was that 2 cache lines were necessary per thread, to accommodate the caller and callee (leaf). There isn't enough space to accommodate 2 cache lines per task, so in my mind it makes no sense to support hubex for multi-threading/tasks.

If Chip decides to implement 4 cache lines, with a single thread of execution, that should work fine with an LRU algorithm.

If code doesn't do hubex, I'd like to see those 4 cache lines generally re-purposeable for PASM code, otherwise they are an architectural feature that is only useful to GCC code, and a huge waste of resources.

I suggested that perhaps the COG should be able to do scatter-gather reads from HUB RAM, telling the DMA that it wants to load a number of addresses in an order. If those were mapped like the QUADs are, you could write some nice code that could DMA from HUB RAM, and let the COG do realtime signal processing, etc.

Each cache line is 8 longs, or 32 bytes, times 4 that's 128 bytes. It would be cool to be able to DMA 128byte chunks for some things.

I'm not entirely sure why we're even talking about multiple GCC tasks in a single COG. It seems to me that we'd need more than an LR register for each task. We'd also need separate copies of the PTRx registers for each task if GCC is to make use of one of PTRA/PTRB as a stack pointer. Also, GCC currently uses 16 pseudo-registers plus a couple of special registers starting at COG address $0 and those would have to be duplicated as well using the register mapping feature. The result would be that half of COG memory would be consumed by four sets of GCC registers. This coupled with the fact that the cache will not perform well if shared among multiple tasks makes me think that there really isn't a need to support multiple C tasks in a single COG.

Cluso99 · 2013-12-16 18:55

David Betz wrote: »

I'm not entirely sure why we're even talking about multiple GCC tasks in a single COG. It seems to me that we'd need more than an LR register for each task. We'd also need separate copies of the PTRx registers for each task if GCC is to make use of one of PTRA/PTRB as a stack pointer. Also, GCC currently uses 16 pseudo-registers plus a couple of special registers starting at COG address $0 and those would have to be duplicated as well using the register mapping feature. The result would be that half of COG memory would be consumed by four sets of GCC registers. This coupled with the fact that the cache will not perform well if shared among multiple tasks makes me think that there really isn't a need to support multiple C tasks in a single COG.

+1
And I think similar restrictions will apply to anyone trying pasm hub code, although I suppose that at least we can restrict usage between hub pasm threads. But overall, I see no requirement for hub pasm to run in more than one task per cog.

jmg · 2013-12-16 19:24

David Betz wrote: »

I'm not entirely sure why we're even talking about multiple GCC tasks in a single COG..

Depends on the semantics ?
I can see that Multiple, Hub-Exec programs in one COG, may cross the cost/benefit line, but I can see a need for GCC compiled blocks, running in COG, but not in Hub-Exec mode, only Hub-Data R/W needed.
Those compact GCC generated modules, need to co-operate with the Hub-Exec module, and allow user control of the boundaries.

One question, is how little of a COG can Hub-Exec mode GCC use, and still be useful ?

David Betz · 2013-12-16 19:36

jmg wrote: »

Depends on the semantics ?
I can see that Multiple, Hub-Exec programs in one COG, may cross the cost/benefit line, but I can see a need for GCC compiled blocks, running in COG, but not in Hub-Exec mode, only Hub-Data R/W needed.
Those compact GCC generated modules, need to co-operate with the Hub-Exec module, and allow user control of the boundaries.

One question, is how little of a COG can Hub-Exec mode GCC use, and still be useful ?

At a minimum each GCC task will need a copy of the GCC pseudo registers. This is currently 16 plus PC, SP, and LR. Now with hub execute mode there isn't need for an LMM PC. If we use PTRA as the stack pointer there is no need for SP. If we get the LR register we've asked Chip for then we don't need that as a pseudo-register. In that case GCC can make do with 16 COG locations to use as pseudo registers. However, that assumes that Chip will provide separate copies of PTRA and LR for each task and I'm not sure that is planned. If not then SP will probably have to be a pseudo register and LR may need to occupy a COG register as well. Since registers can only be mapped in powers of 2 that means that each GCC task would need to reserve 32 COG locations for its pseudo register set. This may not be as bad as it sounds because we could just add additional general pseudo registers and GCC will be able to make use of them in its code generator. The problem is, no GCC code or any runtime code will be able to make use of the PTRx registers at all if they are not duplicated one per task.

cgracey · 2013-12-16 20:01

David Betz wrote: »

...The problem is, no GCC code or any runtime code will be able to make use of the PTRx registers at all if they are not duplicated one per task.

You don't really need any PTR register if you just maintain a regular RAM register as a stack pointer. I would think that C is not likely to do more than push/pop, anyway, and that is just a WRxxxx/RDxxxx instruction with an ADD/SUB instruction after or before it to update the address. PASM is likely to get lots of use out of the PTRs, though.

Dave, have you thought about not using the PTR registers? I think the benefit to C is very marginal, but they are significant to PASM programs that might be running in other tasks. In Spin2, I made the interpreter so that it used almost none of these resources, so they'd be available to PASM code in other tasks.

I think a good way to minimize the need for these special C registers would be to use the existing LIFO resources via 2..3 instruction sequences that stay in cog RAM, so that C calls them, instead of emitting a lot of duplicate code in hub space. Is that reasonable to do, or does it complicate the way gcc wants to work?

Hub Execution Model Thread (split from blog)

Comments