Propeller II update - BLOG

Bill Henning · 2013-12-14 09:31

Depends on how many transistors you have to spare

if I remember the previous post correctly:

- there are 4 lines of 8 longs for the cog; if hubexec is run in all four tasks that effectively becomes 1 line of 8 long per task ('i' instruction cache) which is OK especially if the auto-prefetch is per task
- there is 1 line of 8 longs for the cog for the RDxxxC instructions, shared by the four tasks ('d' data cache)

If my understanding above is correct then:

- having a separate 8 long RDxxxC per task (instead of one line shared between the four tasks) will help tremendously, as otherwise they will tend to flush it on every access

We would get more bang for the buck (ie transistors) by bigger caches than bigger LIFO, and I'd really hate to see the hub shrink below 256KB

In the future (P3?)

- 8 or more lines of code cache per cog will be a nice performance boost
- 4 or more lines of data cache per cog will also be a nice performance boost

Now of course it is not worth increasing the cache to the point where a cache access takes more than one cycle.

cgracey wrote: »

You mean for RDBYTEC/RDWORDC/RDLONGC, right? And you're talking about more 8-long cache lines than just one?

cgracey · 2013-12-14 09:40

Thanks for the cache advice, Bill. We'll see where we're at in terms of area when we synthesize again. I could see this helping a lot if all tasks were reading hub memory.

Bill Henning · 2013-12-14 09:40

Just use CALLA, which will automatically push it to a hub stack, taking one less instruction than explicitly pushing LR before a call.

For leaf functions, just use CALL (that pushes onto the fifo) and corresponding return.

For dispatch table,

POP reg
ADD reg,#offset

instead of

MOV reg,#offset
ADD reg,LR

Using LR, and explicitly pushing it, in GCC code is now counter productive as the CALLA / CALLB hub variants push the return address, saving 1 long on every CALL.

For leaf functions, just use CALLX/CALLY/CALL instead of CALLA/B, then no need for an explicit LR, and still avoids the extra hub cycle.

David Betz wrote: »

My guess is that, if there is a return FIFO for the CALL instruction, every non-leaf C function will immediately pop the return address off that stack and push it onto the hub stack adding one extra instruction to the function prologue. Leaf functions could leave the return address on the stack. Either that or we'll end up using the hub stack call functions for everything. There is a big advantage to having a standard calling convention for all C functions so we can't easily use one call instruction for leaf functions and a different one for non-leaf functions. I hope Eric will correct me if I'm wrong here.

Also, I am aware that there would need to be four copies of the LR register. Maybe that's a deal breaker.

Sorry I can't be more responsive this weekend!

Bill Henning · 2013-12-14 09:43

cgracey wrote: »

Would there be any value in a 'PEEK D' instruction that returns the last-written FIFO stack value without popping it? That would be mindless to add.

Yes, that would be helpful, as otherwise the value has to be POP'ed and PUSH'd to not disturb the LIFO stack.

Bill Henning · 2013-12-14 09:49

You are most welcome!

Personally, I am mostl likely to use hubexec on a per cog (instead of per task) basis to avoid the cache thrashing as that would slow the hubexec tasks down enormously, and I did not think we had the transistor budget to increase the number of I & D cache lines available.

cgracey wrote: »

Thanks for the cache advice, Bill. We'll see where we're at in terms of area when we synthesize again. I could see this helping a lot if all tasks were reading hub memory.

jmg · 2013-12-14 11:39

cgracey wrote: »

..
When we synthesize the logic block, if we have room, I'll increase these stack depths from 4 to 8. I think 4 is actually quite adequate for internal cog use, and every task has a set. The old 'CALL #label' and 'label_ret RET' pairings are history. This makes code cleaner to write and read, plus routines become reentrant, which isn't so practical for recursion, but it means tasks can independently call the same routine now without return-addresses getting over-written.

4 is sounding ok, and certainly shared routines will help make use of the limited COG memory.
I think those shared routines will have restrictions on what they can do, as the phase of a call with tasks can be anything at all ?

David Betz · 2013-12-14 11:49

Bill Henning wrote: »

Just use CALLA, which will automatically push it to a hub stack, taking one less instruction than explicitly pushing LR before a call.

For leaf functions, just use CALL (that pushes onto the fifo) and corresponding return.

If you're calling a function through a pointer you don't know if it is a leaf function or not. If you're calling a function in a separately compiled module the same is true. The GCC code generator generally uses the same calling convention for all functions to handle these issues. This means that we'll probably use the hub stack even for leaf functions with the corresponding decrease in performance because of two hub accesses even in leaf functions. The LR version of CALL gets around this problem but apparently isn't possible for P2. Maybe we can add it to the wish list for P3...

David Betz · 2013-12-14 11:50

Bill Henning wrote: »

Yes, that would be helpful, as otherwise the value has to be POP'ed and PUSH'd to not disturb the LIFO stack.

The GCC code generator will likely not use the LIFO stack so I guess I shouldn't be commenting on what would be useful and what will not.

jmg · 2013-12-14 11:53

tonyp12 wrote: »

So many tricks and round-around Is needed due to the 32bit limit.
Is next step to 64bit really needed?, 36bit or 48bit though sounds weird is just enough to clean up the Instruction OPs and give 1024 cog longs and also 10-12 bit direct values etc.

http://en.wikipedia.org/wiki/36-bit

Technically, yes, the opcode size can increment any amount, synthesis does not care, it is just a number.
I believe some blocks are custom cells, and byte-wide, so that rather quantizes opcode steps to 32,40,48,56,64
Quad/Oct fetches from HUB memory then need a little care, as you want to ensure full opcodes load, but you also do not want the 'left over' piece, over-writing used memory - expanding HUB memory to the same width.costs quite a lot of silicon, and makes data-handling harder. So things get compromised somewhere...

potatohead · 2013-12-14 12:13

I second this. Overall, having that stream of code maximized, even if it needs to be redirected to handle a few things is worth more than the tasking is in this context.

Bill Henning wrote: »

You are most welcome!

Personally, I am mostl likely to use hubexec on a per cog (instead of per task) basis to avoid the cache thrashing as that would slow the hubexec tasks down enormously, and I did not think we had the transistor budget to increase the number of I & D cache lines available.

Seairth · 2013-12-14 13:06

Could someone give an overview of the current planned approach to the hub cache lines and how they will work with the hub execution mode? There's been so much discussion on the topic, I can no longer tell which ideas have been discarded and which have not.

Cluso99 · 2013-12-14 13:21

Chip,
If multitasking is not used, would those extra "blocks" (LIFOs, etc) be usable by the one task?

jazzed · 2013-12-14 14:30

cgracey wrote: »

Thinking more about this, just writing to $1F1 wouldn't be adequate for more than one hub task. The new CALL/POP combo would work better because it pulls from the task's own LIFO task.

I really don't see why we nee more than one HUB task. Can you just save that for the COG stuff ?

jmg · 2013-12-14 14:54

jazzed wrote: »

I really don't see why we nee more than one HUB task. Can you just save that for the COG stuff ?

I can see a use case for two (like the classic one-reads, the other-writes), but if the silicon cost is significant, or high enough to compromise 2 well resourced access paths, into 8 marginal ones, then the upper number of supported HUB tasks needs to carefully looked at.

potatohead · 2013-12-14 15:01

I think this is about looking at THE use case. Maximize that, leaving tasks for COG mode.

Bill Henning · 2013-12-14 15:13

Ok, in that case just use CALLA even for leaf functions.

That removes a lot of complexity, and saves a long for every call. Probably a more significant win than avoiding a hub hit on a leaf function.

David Betz wrote: »

If you're calling a function through a pointer you don't know if it is a leaf function or not. If you're calling a function in a separately compiled module the same is true. The GCC code generator generally uses the same calling convention for all functions to handle these issues. This means that we'll probably use the hub stack even for leaf functions with the corresponding decrease in performance because of two hub accesses even in leaf functions. The LR version of CALL gets around this problem but apparently isn't possible for P2. Maybe we can add it to the wish list for P3...

cgracey · 2013-12-14 16:53

It would be a mess to start making the four sets of task hardware different in capability. It's easier to give them all the same functionality and there's no significant cost in doing so. The programmer can decide how to use them.

evanh · 2013-12-14 16:59

I see one use for the hardware threading combined with hubexec - When the hubexec code is acting as the manager/driver for the soft peripheral(s) that reside within the Cog. Ie: Only one thread using the instruction cache. Beyond that, ordinary prioritised tasks should be used, for which the time-sliced hardware threading is of no value.

In other words, I don't think anything beyond minimum resources should be allocated to fitting fine grained time-sliced multitasking into hubexec.

evanh · 2013-12-14 17:01

cgracey wrote: »

It's easier to give them all the same functionality and there's no significant cost in doing so.

Oh? It sounded like there was becoming a real cost. If you're happy then so am I.

potatohead · 2013-12-14 17:16

Agreed. I too was under that impression.

Yanomani · 2013-12-14 17:55

cgracey wrote: »

It would be a mess to start making the four sets of task hardware different in capability. It's easier to give them all the same functionality and there's no significant cost in doing so. The programmer can decide how to use them.

It's becoming truly difficult to decide which of the current threads better fits, when it comes to select a single one to post some specific questions.

As for the the current envisioned implementation, are there any solid reason to avoid saving the two hardware task selector bits at the stack, beyond the already scheduled ones?

IMHO, I see some nice statistics about usage frequence, reasons of stack overflow (easy detection of infinite loop errors) and even task redistribution schedules between Cogs.

Naturaly, poping it seems to be useless, at least to me.

Perhaps someone else could find better ways to use it, or to support denying it any advantages.

Yanomani

P.S. Under poping, I meant poping to the task selector bits. Knowing from where I was called, seems to be perfectly valid. At least to me.

jazzed · 2013-12-14 19:06

cgracey wrote: »

It would be a mess to start making the four sets of task hardware different in capability. It's easier to give them all the same functionality and there's no significant cost in doing so. The programmer can decide how to use them.

That's fine. Since it's built-in I suppose it doesn't matter. As far as the GCC code generator goes, I trust that Eric will make the best of the available resources.

Since you appear to be closer to a working implementation, do you have any clue of the performance improvements that just single thread HUB-Fetch-Execute might offer over using the COG-Fetch-Execute LMM VM ?

David Betz · 2013-12-14 19:26

Bill Henning wrote: »

Ok, in that case just use CALLA even for leaf functions.

That removes a lot of complexity, and saves a long for every call. Probably a more significant win than avoiding a hub hit on a leaf function.

Edit: Deleted because it was hopeless.

cgracey · 2013-12-14 20:49

David Betz wrote: »

Frankly, I don't see how adding a FIFO could be easier than adding an instruction that sets a COG register to the return address. After all, that's pretty much what JMPRET already does except this instruction would have to save the full 16 bit PC. And using register mapping to avoid needing multiple copies of LR would solve that problem as well. I find it strange that we started talking about CALL_LR, Chip then proposed a FIFO-based CALL instruction and asked if that would do just as well. I replied that it wouldn't work as well but now we're getting it anyway.

David, I'll try to work this LR issue out to your satisfaction.

I'm thinking that we could force the D address to $000..$003, by task, in stage 4 of the pipeline, and send the return address down the result bus to be written. Then, you could use register remapping to access the LR at $000, regardless of task. I need to investigate what ramifications this will have on the data-forwarding circuitry in the pipeline. This mode will have to be enabled via some instruction, if it gets implemented.

cgracey · 2013-12-14 20:52

jazzed wrote: »

That's fine. Since it's built-in I suppose it doesn't matter. As far as the GCC code generator goes, I trust that Eric will make the best of the available resources.

Since you appear to be closer to a working implementation, do you have any clue of the performance improvements that just single thread HUB-Fetch-Execute might offer over using the COG-Fetch-Execute LMM VM ?

I imagine it will be ~80% of native cog-execute speed.

Cluso99 · 2013-12-14 21:48

cgracey wrote: »

I imagine it will be ~80% of native cog-execute speed.

That would just be absolutely fantastic!!!

Heater. · 2013-12-15 01:26

Chip,

I imagine it will be ~80% of native cog-execute speed.

I have not been following the "execute directly from HUB" developments very closely but how is this even possible?

We have eight processors sharing one RAM which normally results in one eighth memory bandwidth for each.

The wide HUB access instructions which can restore that memory bandwidth to COG access speeds.

And then there is some talk of caching which I did not get at all.

BUT most code has a lot of transfer control instructions, conditionals, loops, calls, returns. On top of that data can be randomly distributed around memory.

At that point any buffering or caching becomes less effective.

Under what conditions would one get 80% of native speed? I would imagine it requires rather unusual straight line code and the "working set" of data to be in COG registers.

evanh · 2013-12-15 02:40

Heater. wrote: »

Under what conditions would one get 80% of native speed? I would imagine it requires rather unusual straight line code and the "working set" of data to be in COG registers.

Yep, under the conditions of mostly sequential execution with loops being taken more than not and the hugely boosted bandwidth of HubRAM in the P2. There is a 256 bit wide hub databus now. Which can in theory sustain random access wide reads at about 6 GB/s. And due to it being SRAM there isn't any long-winded setup times. It is fully capable of feeding, and then some I think, every Cog at full instruction rates. That said, there must be potential for a branch stall to extend out to a whole hub rotation though.

What happens with data accesses back to hub I'm not sure but the extra bandwidth that came with the 256kB might be one reason Chip is indicating so high a figure. Instruction fetching now only needs half the full bandwidth I think.

evanh · 2013-12-15 02:48

I guess there is opportunity to wide align the code for optimal performance.

David Betz · 2013-12-15 03:48

cgracey wrote: »

David, I'll try to work this LR issue out to your satisfaction.

I'm thinking that we could force the D address to $000..$003, by task, in stage 4 of the pipeline, and send the return address down the result bus to be written. Then, you could use register remapping to access the LR at $000, regardless of task. I need to investigate what ramifications this will have on the data-forwarding circuitry in the pipeline. This mode will have to be enabled via some instruction, if it gets implemented.

It sounds like this is going to be a lot of trouble for you. I guess I thought it was mostly already implemented because JMPRET already knows how to write its return address to a COG register and it would simply be a matter of writing the full 16 bit PC into all 32 bits of the register rather than the 9 bit PC in the S field. If it's going to take major reworking of instruction decoding and the pipeline then it probably isn't worth the risk. Thanks for offering though. As I mentioned in an earlier message,

PropGCC will probably end up using CALLA/RETA for calling all functions with a two hub access hit on leaf functions. Unfortunately, leaf functions are where you really want good performance since they tend to be called in inner loops. I'm kind of surprised that Bill seems to think that the two hub accesses is no big deal when he is at the same time asking for PEEK for the new return stack. The extra PUSH that would be required in the absense of PEEK would only add a single instruction time where the two hub accesses for leaf functions could add up to 16 instruction times to a leaf function.

Propeller II update - BLOG

Comments