Hub Execution Model Thread (split from blog)

David Betz · 2013-12-16 20:09

cgracey wrote: »

You don't really need any PTR register if you just maintain a regular RAM register as a stack pointer. I would think that C is not likely to do more than push/pop, anyway, and that is just a WRxxxx/RDxxxx instruction with an ADD/SUB instruction after or before it to update the address. PASM is likely to get lots of use out of the PTRs, though.

Dave, have you thought about not using the PTR registers? I think the benefit to C is very marginal, but they are significant to PASM programs that might be running in other tasks. In Spin2, I made the interpreter so that it used almost none of these resources, so they'd be available to PASM code in other tasks.

I think a good way to minimize the need for these special C registers would be to use the existing resources via 2..3 instruction sequences that stay in cog RAM, so that C calls them, instead of emitting a lot of duplicate code in hub space.

I would think that the indexed PTRx modes would be useful for accessing local stack variables. However, GCC passes most parameters in registers so that might be of minimal value. We'll probably not use the PTRx registers at first since the current code generator doesn't know about them but it might be worth investigating if they can be used to speed up hub stack access at some point.

cgracey · 2013-12-16 20:12

David Betz wrote: »

I would think that the indexed PTRx modes would be useful for accessing local stack variables. However, GCC passes most parameters in registers so that might be of minimal value. We'll probably not use the PTRx registers at first since the current code generator doesn't know about them but it might be worth investigating if they can be used to speed up hub stack access at some point.

I suspect that because the offsets in PTRx instructions must be hard-coded into the instructions, they might be of no use. If that's the case, it would be cool to eliminate their use in C to render the PTR resources available to PASM code in other tasks. Also, and this is important: it allows 4 hub tasks running C.

As I worked on Spin2, I came to the realization that it was best to tread lightly on resource requirements, to maximize what PASM would be able to accomplish in other tasks using those resources. What runs in PASM tasks will be hand-crafted code that is able to really get maximum benefit from those resources, where they're only of marginal value to high-level code.

The LIFO's are great task-specific resources that needn't be conserved, though. I wish gcc would let you use these things more easily. I hate the idea of making hardware bow to software limitations.

David Betz · 2013-12-16 20:16

cgracey wrote: »

I suspect that because the offsets in PTRx instructions must be hard-coded into the instructions, they might be of no use.

I'm not sure that's true. The offsets into the stack frame are known at compile time and can be constants.

jmg · 2013-12-16 20:27

David Betz wrote: »

...However, that assumes that Chip will provide separate copies of PTRA and LR for each task and I'm not sure that is planned. If not then SP will probably have to be a pseudo register and LR may need to occupy a COG register as well. ...

Will Chip will provide separate copies of PTRA and LR for each task ?

Where resources do not have a per-Task-copy, what happens when a 'other' task tries to use such resource ?
It is locked out, or is there an open-slather, where anyone who wants to can have after single-copy resource ?
The code management issues here could get tricky.

jazzed · 2013-12-16 20:34

jmg wrote: »

The code management issues here could get tricky.

Yup. If we need more threads, we can start new COGs.

Apparently Chip was looking for more symmetry with regard to HUB.

I'm afraid that old man Occam would be trimming HUBEXEC out completely by now.

cgracey · 2013-12-16 20:41

jmg wrote: »

Will Chip will provide separate copies of PTRA and LR for each task ?

Where resources do not have a per-Task-copy, what happens when a 'other' task tries to use such resource ?
It is locked out, or is there an open-slather, where anyone who wants to can have after single-copy resource ?
The code management issues here could get tricky.

It's a free-for-all. Anyone can use anything.

If register remapping is enabled and each C task is given 32 unique registers, all accessible from 0..31, four hub exec tasks would have total symmetry. They'd need to mind their own stack pointers and the CALL/RET LIFO's could provide LR-type functionality. There could be some static code at 128 (above 32 registers x 4 blocks/tasks) that could contain housekeeping routines that would execute without caching limitations that could take care of housekeeping, even allowing the compiled C in hub memory to be very compact data, for some purposes, and not verbose code. I say, if we're going to have C, let it run in every task, not just one.

cgracey · 2013-12-16 20:53

In light of the 4-block register remapping, I think I'll implement the LR at $000, causing it to spread to other physical locations, according to task, but always task-accessible at $000. I think having the potential for 4 identical C tasks via hub exec is really neat.

David Betz · 2013-12-16 20:56

cgracey wrote: »

In light of the 4-block register remapping, I think I'll implement the LR at $000, causing it to spread to other physical locations, according to task, but always task-accessible at $000. I think having the potential for 4 identical C tasks via hub exec is really neat.

Sounds good. Thanks Chip!

jmg · 2013-12-16 20:58

cgracey wrote: »

It's a free-for-all. Anyone can use anything.

If register remapping is enabled and each C task is given 32 unique registers, all accessible from 0..31, four hub exec tasks would have total symmetry. They'd need to mind their own stack pointers and the CALL/RET LIFO's could provide LR-type functionality. There could be some static code at 128 (above 32 registers x 4 blocks/tasks) that could contain housekeeping routines that would execute without caching limitations that could take care of housekeeping, even allowing the compiled C in hub memory to be very compact data, for some purposes, and not verbose code. I say, if we're going to have C, let it run in every task, not just one.

Sounds good - how much of that is in current core, and how much is aspirational or coming ?
The above sounds both easy to explain, and easy to understand.

The worst nightmare would be some one-off resource, that might be clobbered rarely, with no way of checking no one else is trying to use it. In some uC there is an exchange opcode, which gives atomic swap of 2 locations, but with tasks that is not enough protection, as atomic is not possible.

jmg · 2013-12-16 21:01

cgracey wrote: »

In light of the 4-block register remapping, I think I'll implement the LR at $000, causing it to spread to other physical locations, according to task, but always task-accessible at $000. I think having the potential for 4 identical C tasks via hub exec is really neat.

That sounds inherently safe ? - and safe appeals much more than neat

It also sounds flexible - Someone may use an expanded Hux-exec task during debug, and then disable all the checks/reporting, and thus shrink it, to always fit in a portion of COG.

potatohead · 2013-12-16 21:10

I'm kind of laughing.

Not in a bad way. It's just really different. I think we are going to have a fine time exploiting this thing, and I think it's going to take a while too. Once that is all done, we are going to know how to just nail P3.

I would not pull HUBEXEC at this point. It's mostly done, and I really do think the priority should be maximizing it on one task so that we get fast, larger programs from PASM and GCC. Anything beyond that is fun 'n games as far as I am concerned. And who knows? As it gets explored, we may well find highly optimized cases make a lot of sense.

In this, I agree with Chip. We know the one task case is needed, and it should perform well. Leaving it open let's us exploit the hardware, again probably learning just how a P3 will sing.

cgracey · 2013-12-16 21:12

jmg wrote: »

Sounds good - how much of that is in current core, and how much is aspirational or coming ?
The above sounds both easy to explain, and easy to understand.

I'd just need to implement the LR at $000. Register remapping has been part of the architecture for a long time. And it is really simple. You specify the block size as some power of two, and in the case of task-based remapping, you get four blocks of the requested size, arranged sequentially in cog RAM, but the first block's addresses spread out to the other blocks according to task. This, with LR at $000, provides the complete register context that C seems to want.

Yanomani · 2013-12-16 21:41

potatohead wrote: »

I'm kind of laughing.

Not in a bad way. It's just really different. I think we are going to have a fine time exploiting this thing, and I think it's going to take a while too. Once that is all done, we are going to know how to just nail P3.

I would not pull HUBEXEC at this point. It's mostly done, and I really do think the priority should be maximizing it on one task so that we get fast, larger programs from PASM and GCC. Anything beyond that is fun 'n games as far as I am concerned. And who knows? As it gets explored, we may well find highly optimized cases make a lot of sense.

In this, I agree with Chip. We know the one task case is needed, and it should perform well. Leaving it open let's us exploit the hardware, again probably learning just how a P3 will sing.

Hi potatohead

As you, I've been just laughing here, in the same way (harmless) and almost by the same reason.

Wishing here that those seeds be spread by the good winds of wisdom, as some perennial Peanut, covering with their yellow flowers, every space of this true coder's garden.

So please, try each and everyone help providing as many ozpropdev et al. clones as possible, to evenly distribute them and their healthy influence among all of us!

Yanomani

K2 · 2013-12-17 12:59

potatohead wrote: »

I would not pull HUBEXEC at this point. It's mostly done, and I really do think the priority should be maximizing it on one task so that we get fast, larger programs from PASM and GCC. Anything beyond that is fun 'n games as far as I am concerned. And who knows? As it gets explored, we may well find highly optimized cases make a lot of sense.

It would be great to be so successful with the Propeller line that we (Parallax I mean) could afford product differentiation: One Propeller optimized for GCC and another for PASM, for example. Or a large Prop and another smaller and wickedly fast Prop.

For me, the C accomodations that are being made don't hold much appeal, but it's much too early to know exactly what we have in the P2. In a few months I could be singing a different tune entirely.

cgracey · 2013-12-17 13:30

K2 wrote: »

For me, the C accomodations that are being made don't hold much appeal, but it's much too early to know exactly what we have in the P2. In a few months I could be singing a different tune entirely.

These features are useful for PASM and other high-level languages, as well. It will make the Spin interpreter a lot faster, since I don't need to spool up snippets into cog RAM, but can execute them directly from the hub. It pretty much leaves 95% of the cog RAM free for PASM or variables. It might also allow Spin threads to be implemented way more easily.

David Betz · 2013-12-17 13:44

cgracey wrote: »

These features are useful for PASM and other high-level languages, as well. It will make the Spin interpreter a lot faster, since I don't need to spool up snippets into cog RAM, but can execute them directly from the hub. It pretty much leaves 95% of the cog RAM free for PASM or variables. It might also allow Spin threads to be implemented way more easily.

See, we C people do have good ideas occasionally! :-)

eldonb46 · 2013-12-17 16:04

cgracey wrote: »

It might also allow Spin threads to be implemented way more easily.

WOW, this is great news. I have been intently reading this and other threads for the last 12 months, looking for clues as to how all of the discussions effects P2 SPIN.

See my post and questions at: http://forums.parallax.com/showthread.php/141706-Propeller-II?p=1122712#post1122712 ,
and Chip's answer at: http://forums.parallax.com/showthread.php/144683-Propeller-II-programing-questions-to-Chip?p=1198995&viewfull=1#post1198995

Chip, I did not completely understand your response at the time, as I am not a PASM programmer, but really like the idea of maybe having multiple SPIN tasks within a single SPIN Object. From my prospective; it would be useful to effectively have multiple Program Counters (SPIN execution pointers), all running at the same time over multiple coordinated methods. I have written several P1 projects where this ability would have been very useful.

Again, from my perspective, a SPIN Interpreter task should be initiated with similar key words as "cognew", that is: "TASKNEW (SpinMethod, <(PametersList)>, StackPointer)", and maybe a SETTMASK comand to control scheduling.

This of course all assumes that SPIN Tasks could be implemented in a way that a "waitcnt/waitpeq/waitpne/waitvid" will not stall any of the other three SPIN tasks within the object.

Chip, I am very glad the YOU are the author, and the person that heavily promotes SPIN - thanks.

--

K2 · 2013-12-17 16:10

...or I could be singing a different tune a lot sooner.

Thanks for putting this in perspective.

Really fast SPIN, or the ability to run eight simultaneous and substantial C instantiations would be outstanding.

cgracey · 2013-12-18 01:01

I wound up making all the 9-bit immediate branches into relative branches, instead of absolute. I could see as things were coming together with the new hub exec mode, that it's just going to have to be that way to keep things regular.

This affected (including their -D variants): JMPSW, IJZ, IJNZ, DJZ, DJNZ, JP, JNP, JZ, and JNZ.

Someone pointed out that this could make cog snippets relocatable, particularly when their variable registers are remapped, giving identical windows to several tasks or threads.

I really want to get this whole thing done so that I can start playing with it. It's going to be fun to build things out of.

potatohead · 2013-12-18 01:13

I think so too, and relocatable COG code just isn't a bad thing. Yeah, you've built up a playground. Can't wait to see what we all end up doing with it.

Cluso99 · 2013-12-18 01:14

cgracey wrote: »

I wound up making all the 9-bit immediate branches into relative branches, instead of absolute. I could see as things were coming together with the new hub exec mode, that it's just going to have to be that way to keep things regular.

This affected (including their -D variants): JMPSW, IJZ, IJNZ, DJZ, DJNZ, JP, JNP, JZ, and JNZ.

Someone pointed out that this could make cog snippets relocatable, particularly when their variable registers are remapped, giving identical windows to several tasks or threads.

It certainly makes the coding more regular between hub and cog modes.

I really want to get this whole thing done so that I can start playing with it. It's going to be fun to build things out of.

You bet! We cannot wait either

Bill Henning · 2013-12-18 05:39

Sounds good!

Time to set up my DE2-115 again... and a nano or two to try the fast serial

cgracey wrote: »

I wound up making all the 9-bit immediate branches into relative branches, instead of absolute. I could see as things were coming together with the new hub exec mode, that it's just going to have to be that way to keep things regular.

This affected (including their -D variants): JMPSW, IJZ, IJNZ, DJZ, DJNZ, JP, JNP, JZ, and JNZ.

Someone pointed out that this could make cog snippets relocatable, particularly when their variable registers are remapped, giving identical windows to several tasks or threads.

I really want to get this whole thing done so that I can start playing with it. It's going to be fun to build things out of.

Hub Execution Model Thread (split from blog)

Comments