Hub Execution Model Thread (split from blog)

David Betz · 2013-12-10 03:44

cgracey wrote: »

I spent all day getting the PC mapped into $1F0/$1F1, only to realize that it extended the critical path of the whole chip and slowed it way down. The reason is that the computed ALU result is the last-arriving signal set, and to run it through a few more sets of mux's to accommodate the four task PC's, and then get it out to the cog RAM instruction address input, just takes too long. The only way to circumvent these delays is to add another pipeline stage, which will make cancelling branches take one more clock, and 4-way multitasking branches take two clocks, instead of one. It's not worth it. So, the PCs will have to be addressed by instructions, only, in which the PC result does not go through the main ALU. It was worth trying, though, because the benefits would have been great. I think to compensate, I'll make relative jumps, which are easy to implement without drawbacks. This will give us the same performance we would have had with mappable PC's, when it comes to adding to them.

Yes, it was worth trying. Sorry it didn't work out!

evanh · 2013-12-10 03:54

cgracey wrote: »

None of that is going away, afterall.

Cool. Certainly is a fast moving topic.

Heater. · 2013-12-10 03:57

evanh,

I agree, stretching the hardware thread scheduling beyond the confines of a COG is not making much sense to me at the moment, especially if it has large repercussion elsewhere. Of course if it can be done easily, from both an implementation and user point of view, then it's a winner.

They are preset time-sliced and therefore of no benefit to normally prioritised multitasking

I don't agree.

It's that hub slot argument again. If one thread is not using the time then perhaps performance could be boosted by letting another thread us it. It requires more complicated scheduling even before we start to assign priorities to threads. I don't see it ever being possible given the pipeline we are working with.

I would suggest that 100% fixed round robin scheduling has benefits all of it's own in terms of predictability and ease of programming. Never mind a little loss in efficiency. The hardware threads are already a lot more efficient than creating threads manually with JMPRET or TASKSWITCH like instructions. Nothing to complain about there.

evanh · 2013-12-10 04:07

Heater. wrote: »

The hardware threads are already a lot more efficient than creating threads manually with JMPRET or TASKSWITCH like instructions. Nothing to complain about there.

That efficiency advantage vanishes once the granularity is a suitably large size. Priorities become overwhelmingly more important in bloated^H^H^H^H^H^H^H larger code. But, yeah, I've got no problem having both methods available if it's not compromising other parts of the design.

EDIT: I was thinking in terms of speed of course. But in terms of code size, yielding is a wasteful multitasking mechanism. It's not the only other option here I hope.

Heater. · 2013-12-10 04:34

evanh,

There are three huge advantages to hardware thread scheduling:

1) Reduce code size. No TASKSWITCH instructions required. This becomes more important as the rate at which you need to switch between tasks increases. ie when tasks need low latency response to external events.

2) Exactly that reduced latency to external events. You don't have to wait for the other threads to hit a TASKSWITCH.

3) Over all increase in performance simply due to the reduced number of instructions required.

One could argue that priorities could be important even with small code. Imagine two tasks waiting on a pins. Perhaps we would want one of them to react to it's pin and complete its response within only four or five instructions. Whilst the other has hundreds of instructions time to work in. In that case priorities would allow the job to be done. That however is pushing things to far and I don't ever see it as being practical to implement.

evanh · 2013-12-10 04:47

Priorities at the low level requires hardware support for very little gain. Much more can be gained by just doing the peripheral in hardware and buffering.

For multitasking, hardware time-slicing dosen't help simply because the granularity is coarse. And it usually gets in the way because if even one thread is sleeping then it's time slots go unused.

David Betz · 2013-12-10 04:50

cgracey wrote: »

I spent all day getting the PC mapped into $1F0/$1F1, only to realize that it extended the critical path of the whole chip and slowed it way down.

Dare I ask if this now means that one of those registers might be available for LR?

Heater. · 2013-12-10 05:10

evanh,

The attraction of hardware thread scheduling is in things like serial communications where basically you have an Rx and a Tx thread. To get maximum speed and reliability they both need to have short latencies. Rx to the incoming edges on a pin, Tx to the clock that is timing the output edges. The priorities are balanced. Both Tx and Rx have to be able to get the time they need when they need it. Removing the need for TASKSWITCH instructions increases the possible speed by removing code. Even if there is no Tx going on, say, there is no point to give those free time slots to Rx because those slots have to be there if and when they are needed.

Priorities and complicated scheduling can increase the total compute capacity on average but is no good when you need guarantees of available cycles. It's the same issue one has with interrupts when pushed to the limit.

cgracey · 2013-12-10 05:14

David Betz wrote: »

Dare I ask if this now means that one of those registers might be available for LR?

Wait until you see how the hub execution works. You might deem LR superfluous.

David Betz · 2013-12-10 05:25

cgracey wrote: »

Wait until you see how the hub execution works. You might deem LR superfluous.

Okay, fair enough. I'll wait for your description of hub execution mode.

evanh · 2013-12-10 05:26

Heater. wrote: »

Priorities and complicated scheduling can increase the total compute capacity on average but is no good when you need guarantees of available cycles. It's the same issue one has with interrupts when pushed to the limit.

Doh! I'm clearly pointing out that the slicing is great for soft-peripherals. And I'm clearly pointing out that priorities are not good for this.

But, I'm also pointing out, independently, for kernel level multitasking, that the granularity is so coarse that time slicing no longer helps speed wise and will often slow things down and that priorities are what's important at this level.

Heater. · 2013-12-10 05:30

evanh,

OK, we're good to go

Cluso99 · 2013-12-11 00:14

Where is HUBEXEC up to?? I have been away for a couple of days so my access has been infrequent.

Why does Hubexec require more than 1 PC ? I understand the requirement if it runs as a task, and that makes everything equal. But why do we need a 9-bit PC and a (16+2) 16-bit +00 PC? Surely, for cog mode, the upper bits are either 0's or ignored. Hub mode is not going to run from ROM so we can discount the lower 2KB where the cog would be mapped. Am I missing something?

Even the resulting S & D addresses could later be expanded to the 16+2 bits, and the pipeline could stall while it fetched the data from hub if the S and/or D address resulted in a hub address. But this would be a P3 implementation.

The idea of DJZ etc being relative jumps sounds nice. But if this changes to relative, then it should be relative for both Hubexec and Cog modes. This is really only a compiler issue anyway. It would actually permit the use of +/-255 relative jumps or with the use of the AUGS instruction this would increase to half the hub (or just +/- 512 cog using 1 bit - no reason to limit it to hub). in fact, why not make all the JMPRET/CALL/RET instructions relative, and then maybe just one non-delayed direct JMP (sort of a HJMP like Bill/David are discussing)?

Does any of this make sense? I just don't want to see massive instruction changes done at this late stage - I presume changing from absolute to relative would be quite simple.

Why do we need that many more full instructions? How many do we need?

cgracey · 2013-12-11 00:25

Cluso99 wrote: »

Where is HUBEXEC up to?? I have been away for a couple of days so my access has been infrequent.

Why does Hubexec require more than 1 PC ? I understand the requirement if it runs as a task, and that makes everything equal. But why do we need a 9-bit PC and a (16+2) 16-bit +00 PC? Surely, for cog mode, the upper bits are either 0's or ignored. Hub mode is not going to run from ROM so we can discount the lower 2KB where the cog would be mapped. Am I missing something?

Even the resulting S & D addresses could later be expanded to the 16+2 bits, and the pipeline could stall while it fetched the data from hub if the S and/or D address resulted in a hub address. But this would be a P3 implementation.

The idea of DJZ etc being relative jumps sounds nice. But if this changes to relative, then it should be relative for both Hubexec and Cog modes. This is really only a compiler issue anyway. It would actually permit the use of +/-255 relative jumps or with the use of the AUGS instruction this would increase to half the hub (or just +/- 512 cog using 1 bit - no reason to limit it to hub). in fact, why not make all the JMPRET/CALL/RET instructions relative, and then maybe just one non-delayed direct JMP (sort of a HJMP like Bill/David are discussing)?

Does any of this make sense? I just don't want to see massive instruction changes done at this late stage - I presume changing from absolute to relative would be quite simple.

Why do we need that many more full instructions? How many do we need?

I've got the master plan all worked out now for hub execution. It's taken a few days of thinking to get it all sorted out, but it turned out to be pretty simple, and the programmer will not be burdened with lots of strange branch instructions. Code that runs in the hub is written with the same branch instructions as code that runs in the cog. Each branch instruction has a special version which toggles between hub/cog mode for the branch destination. All stack saves incorporate this mode bit. RETurns restore the caller's hub/cog mode. So, you can call from hub to cog and vice-versa with the same set of instructions. It's really easy to use. When in hub mode, all those DJNZ/JP/etc. branches becomes relative, if immediate. The assembler will know by a directive whether it is assembling hub or cog code. The context between hub and cog modes is very fluid. I think everyone's going to be very happy.

All task PC's have been expanded from 9 to 16 bits, to accommodate hub addressing. It was natural to do it this way, since the PC's have all the pipeline mechanisms built into them. I even added a 4-bit field to JMPTASK's PC mask which will allow you to launch hub threads directly, without even starting them in the cog. It's going to be sheer simplicity.

ozpropdev · 2013-12-11 00:32

cgracey wrote: »

I've got the master plan all worked out now for hub execution. It's taken a few days of thinking to get it all sorted out, but it turned out to be pretty simple, and the programmer will not be burdened with lots of strange branch instructions. Code that runs in the hub is written with the same branch instructions as code that runs in the cog. Each branch instruction has a special version which toggles between hub/cog mode for the branch destination. All stack saves incorporate this mode bit. RETurns restore the caller's hub/cog mode. So, you can call from hub to cog and vice-versa with the same set of instructions. It's really easy to use. When in hub mode, all those DJNZ/JP/etc. branches becomes relative, if immediate. The assembler will know by a directive whether it is assembling hub or cog code. The context between hub and cog modes is very fluid. I think everyone's going to be very happy.

All task PC's have been expanded from 9 to 16 bits, to accommodate hub addressing. It was natural to do it this way, since the PC's have all the pipeline mechanisms built into them. I even added a 4-bit field to JMPTASK's PC mask which will allow you to launch hub threads directly, without even starting them in the cog. It's going to be sheer simplicity.

Sounds great Chip! Not scary at all...

potatohead · 2013-12-11 00:36

That depends on the goal.

Right now, that is roughly:

1. make sure big programs run at a very good speed relative to COG programs,

2. insure compilers have the support needed to perform well and realize #1,

3. make sure programming in PASM is robust, powerful, fast,

4. address known use cases for #3.

Those things have all been articulated recently, and looking at those, a verbose instruction set best aligns with that goal. My read on things anyway.

In terms of a general case micro-controller, a much smaller, general purpose instruction set would more than suffice. However, we have a lot of known instruction use cases from P1 that can be addressed to improve PASM code size and speed, as well as some new ones identified based on chip capabilities and our FPGA testing.

Seems the balance is tilted toward more instructions, so long as they don't impose a timing limit for the chip overall and they are both flexible and contribute to those cases in terms of code size and speed. The best "bang for the gate" instructions appear to be operators and helper type instructions, which these are.

The pixel instruction Baggers asked for example will mean moving large objects, combining them, blending them, etc... gets a very serious performance boost. The grey code instructions mean the same for encoders, for a parallel type of example. Both are real time type tasks.

If I were to step back and look at the instruction discussion over all, the common emphasis is on maximizing real time activity, minimizing code size, maximizing (compute and throughput / instruction.)

Given that's mostly true, the real discussion here is whether or not the BCD encoder case is on par with the grey code one. If it is, then it makes sense to do both. If not, then BCD isn't needed and software will do.

What I saw here was various cases discussed. The cash register type thing isn't significant to me at all. Make a small routine and do it. We've got big integers to scale, and we've got complex math in chip now too. None of that computation type argument made any real sense either.

About the only compelling thing I saw here was the case of encoders / sensors reporting in BCD, thus my comments above.

I'm not taking a position on these, because I really don't know whether or not the BCD encoder / sensor case is on par with grey code. Just highlighting what I saw in the discussion and the dynamics as they are at present.

potatohead · 2013-12-11 00:42

Just saw Chip's post. Nice! If that is all fluid and simple, it's worth whatever instruction jiggling needs to be done.

I'm stoked about the HUBEXEC!

Frankly, this being fluid right now is precisely why I've not written too much code. Given the current development path, considering code written to date seems non-optimal given most of the code that will benefit is yet to be written at all!

jmg · 2013-12-11 01:21

potatohead wrote: »

Frankly, this being fluid right now is precisely why I've not written too much code. Given the current development path, considering code written to date seems non-optimal given most of the code that will benefit is yet to be written at all!

Even with a changing core
a) Code written in a HLL, will gain automatically as compilers improve
b) Hahd crafted library code gives a good starting point, for further tuning, so is never really wasted.

jmg · 2013-12-11 01:28

cgracey wrote: »

Each branch instruction has a special version which toggles between hub/cog mode for the branch destination. All stack saves incorporate this mode bit. RETurns restore the caller's hub/cog mode. So, you can call from hub to cog and vice-versa with the same set of instructions. It's really easy to use. When in hub mode, all those DJNZ/JP/etc. branches becomes relative, if immediate. The assembler will know by a directive whether it is assembling hub or cog code. The context between hub and cog modes is very fluid. I think everyone's going to be very happy.

Sounds very good, and reminds me of the biggest clanger I saw intel make.

They released a MSC251, and it also had a mode-bit, that allowed expanded reach and a choice of legacy or advanced operation.
What did they do wrong ? instead of making it run-time settable, with a copy in the stack, they set it in OTP memory. (?!)
That meant users could not take existing code, and compile-in faster libraries, or manage a mix of in-line ASM gains.

They also miss-phased with the release of flash micros, and both combined to relegate the part to a footnote in history.

Cluso99 · 2013-12-11 02:29

cgracey wrote: »

I've got the master plan all worked out now for hub execution. It's taken a few days of thinking to get it all sorted out, but it turned out to be pretty simple, and the programmer will not be burdened with lots of strange branch instructions. Code that runs in the hub is written with the same branch instructions as code that runs in the cog. Each branch instruction has a special version which toggles between hub/cog mode for the branch destination. All stack saves incorporate this mode bit. RETurns restore the caller's hub/cog mode. So, you can call from hub to cog and vice-versa with the same set of instructions. It's really easy to use. When in hub mode, all those DJNZ/JP/etc. branches becomes relative, if immediate. The assembler will know by a directive whether it is assembling hub or cog code. The context between hub and cog modes is very fluid. I think everyone's going to be very happy.

All task PC's have been expanded from 9 to 16 bits, to accommodate hub addressing. It was natural to do it this way, since the PC's have all the pipeline mechanisms built into them. I even added a 4-bit field to JMPTASK's PC mask which will allow you to launch hub threads directly, without even starting them in the cog. It's going to be sheer simplicity.

Sounds fantastic. Can hardly wait to see what you have come up with.

evanh · 2013-12-11 03:02

cgracey wrote: »

All task PC's have been expanded from 9 to 16 bits, to accommodate hub addressing. It was natural to do it this way, since the PC's have all the pipeline mechanisms built into them. I even added a 4-bit field to JMPTASK's PC mask which will allow you to launch hub threads directly, without even starting them in the cog. It's going to be sheer simplicity.

Does that mean that COGINIT/COGNEW are all but defunct now?

David Betz · 2013-12-11 03:24

cgracey wrote: »

I've got the master plan all worked out now for hub execution. It's taken a few days of thinking to get it all sorted out, but it turned out to be pretty simple, and the programmer will not be burdened with lots of strange branch instructions. Code that runs in the hub is written with the same branch instructions as code that runs in the cog. Each branch instruction has a special version which toggles between hub/cog mode for the branch destination. All stack saves incorporate this mode bit. RETurns restore the caller's hub/cog mode. So, you can call from hub to cog and vice-versa with the same set of instructions. It's really easy to use. When in hub mode, all those DJNZ/JP/etc. branches becomes relative, if immediate. The assembler will know by a directive whether it is assembling hub or cog code. The context between hub and cog modes is very fluid. I think everyone's going to be very happy.

All task PC's have been expanded from 9 to 16 bits, to accommodate hub addressing. It was natural to do it this way, since the PC's have all the pipeline mechanisms built into them. I even added a 4-bit field to JMPTASK's PC mask which will allow you to launch hub threads directly, without even starting them in the cog. It's going to be sheer simplicity.

Sounds great! How do you handle CALL instructions executed in hub mode? What 9 bit value gets stored in the RET instruction to allow the COG function to return to the correct hub address?

cgracey · 2013-12-11 03:24

evanh wrote: »

Does that mean that COGINIT/COGNEW are all but defunct now?

No. You need those to start the cog. Then, the cog can start any extra tasks, or execute alone, either in the cog or the hub.

evanh · 2013-12-11 03:56

cgracey wrote: »

No. You need those to start the cog. Then, the cog can start any extra tasks, or execute alone, either in the cog or the hub.

I'll refine that a little ... what if the start address was a direct hubexec entry point rather than a block copy into a cog? Dunno, I guess a fast block copy is still a desirable feature in the end, isn't it.

cgracey · 2013-12-11 04:03

evanh wrote: »

I'll refine that a little ... what if the start address was a direct hubexec entry point rather than a block copy into a cog? Dunno, I guess a fast block copy is still a desirable feature in the end, isn't it.

That's an interesting idea - just start a cog with nothing but a hub address. After I get hub execution going, I'll look into this. Right now, I just finished the new instruction mapping, so I need to implement it in the cog and PNut.exe.

Seairth · 2013-12-11 04:08

evanh wrote: »

I'll refine that a little ... what if the start address was a direct hubexec entry point rather than a block copy into a cog? Dunno, I guess a fast block copy is still a desirable feature in the end, isn't it.

I like that idea! If you really want to switch to cog execution mode, it would only be a REPS with a RDQUAD (I think). That block copy is the same speed as the current (implicit) block copy anyhow, so this wouldn't be any slower. In fact, it could be even faster, since you wouldn't necessarily need to copy a full 512 longs, depending on the actual size of your cog code.

Cluso99 · 2013-12-11 06:26

Chip, just a thought. What if the D and S addresses in the pipeline were also expanded to 16+00 bits. Then each standard instruction (AND/XOR/etc) could work where addresses of $000-1FF was cog, $200-3FF was aux, and addresses $300-3FFFF was hub. The pipe would need to stall until hub was fetched. The AUGS/AUGD could extend immediate addresses.

David Betz · 2013-12-11 06:34

Cluso99 wrote: »

Chip, just a thought. What if the D and S addresses in the pipeline were also expanded to 16+00 bits. Then each standard instruction (AND/XOR/etc) could work where addresses of $000-1FF was cog, $200-3FF was aux, and addresses $300-3FFFF was hub. The pipe would need to stall until hub was fetched. The AUGS/AUGD could extend immediate addresses.

I tried to suggest this for COG and hub addresses but it sounds like it's hard to implement. It would be nice to have a linear address space though that includes both COG and hub addresses as well as possibly AUX addresses as you suggest.

Seairth · 2013-12-11 07:26

Cluso99 wrote: »

Chip, just a thought. What if the D and S addresses in the pipeline were also expanded to 16+00 bits. Then each standard instruction (AND/XOR/etc) could work where addresses of $000-1FF was cog, $200-3FF was aux, and addresses $300-3FFFF was hub. The pipe would need to stall until hub was fetched. The AUGS/AUGD could extend immediate addresses.

That would then mean that all hub instructions (which reference hub memory) would effectively become 64-bit instructions?

Seairth · 2013-12-11 07:34

Cluso99 wrote: »

Chip, just a thought. What if the D and S addresses in the pipeline were also expanded to 16+00 bits. Then each standard instruction (AND/XOR/etc) could work where addresses of $000-1FF was cog, $200-3FF was aux, and addresses $300-3FFFF was hub. The pipe would need to stall until hub was fetched. The AUGS/AUGD could extend immediate addresses.

Also, how do you access the ROM (HMAC, monitor, etc.)?

Hub Execution Model Thread (split from blog)

Comments