Hub Execution Model Thread (split from blog)

Bill Henning · 2013-12-09 14:26

Sounds like a nice and simple solution!

cgracey wrote: »

Putting the LR aside for a moment...

If we used both $1F0 and $1F1 for PC, but made writes to $1F0 cancel pipelined instructions and writes to $1F1 not cancel, we could get both normal and delayed branches!

Bill Henning · 2013-12-09 14:27

I really like putting the Spin interpreter into the top level object (ie executable) as over time, Spin will just keep getting better and faster as everyone learns more and more about how to best exploit P2's capabilities.

jmg wrote: »

Sounds like the classic Software moving target
- but as this is a Software issue, and Spin2 is not in ROM, this is (relatively) easy to address over time. Just allow it to occur.

Spin2 can potentially dynamically build (mentioned a while back & just software housekeeping) so it might even fold into that ?

Users would tell the Spin2 builder what thread-code to include, and it then knows what Spin2 variants to craft with.

cgracey · 2013-12-09 14:28

Bill Henning wrote: »

Sounds like a nice and simple solution!

They can be called PC and PCD.

David Betz · 2013-12-09 14:30

cgracey wrote: »

Putting the LR aside for a moment...

Oh well, I knew it was too good to be true. :-)

I guess we could go back to the SETLR instruction that selects one of the COG register to be LR.

Sapieha · 2013-12-09 14:31

Hi Chip.

You mean HPC and HPCD ?

cgracey wrote: »

They can be called PC and PCD.

Bill Henning · 2013-12-09 14:45

I just had a thought.

Could we not use 'WC' on the branch instruction to specify cancelling pipelined instructions?

WC does not make sense for a jump... that way you can use $1F1 for LR

cgracey wrote: »

Putting the LR aside for a moment...

If we used both $1F0 and $1F1 for PC, but made writes to $1F0 cancel pipelined instructions and writes to $1F1 not cancel, we could get both normal and delayed branches!

cgracey · 2013-12-09 14:46

Sapieha wrote: »

Hi Chip.

You mean HPC and HPCD ?

No. It will be the actual PC for that task, which have all been expanded to 16 bits for hub use. When in cog mode, the upper 7 bits are ignored.

cgracey · 2013-12-09 14:48

Bill Henning wrote: »

I just had a thought.

Could we not use 'WC' on the branch instruction to specify cancelling pipelined instructions?

WC does not make sense for a jump... that way you can use $1F1 for LR

If you look at the latest instruction encodings, every opcode that didn't need WZ/WC has had those bits repurposed to pack instructions tighter.

Bill Henning · 2013-12-09 15:40

Thanks, it was worth a try.

cgracey wrote: »

If you look at the latest instruction encodings, every opcode that didn't need WZ/WC has had those bits repurposed to pack instructions tighter.

rogloh · 2013-12-09 17:12

Probably it is just a personal taste thing but having two COG registers acting as the PC just seems a little weird to me. I would have much preferred one register being the PC (eg. $1F0) and the other being the proposed LR ($1F1), as it's just tidier. Maybe it has to be done like that to get the instruction count down but it still feels strange.

Bill Henning · 2013-12-09 17:17

Hi Chip,

cgracey wrote: »

That list would be good to see. It might expose something useful.

This is the list I extracted from the latest zip, I hope I got them all!

C.W. said there are 23 'D'elayed instructions... which ones did I miss?

ZCWS		1010100 ZC I CCCC DDDDDDDDD SSSSSSSSS		JMPRET	D,S/#		(set D to %1_1111_01xx for JMP/RET)
ZCWS		1010101 ZC I CCCC DDDDDDDDD SSSSSSSSS		JMPRETD	D,S/#		(set D to %1_1111_01xx for JMP/RET)

--MS		1010110 00 I CCCC DDDDDDDDD SSSSSSSSS		IJZ	D,S/#
--MS		1010110 01 I CCCC DDDDDDDDD SSSSSSSSS		IJZD	D,S/#   

--MS		1010110 10 I CCCC DDDDDDDDD SSSSSSSSS		IJNZ	D,S/#
--MS		1010110 11 I CCCC DDDDDDDDD SSSSSSSSS		IJNZD	D,S/#

--MS		1010111 00 I CCCC DDDDDDDDD SSSSSSSSS		DJZ	D,S/#
--MS		1010111 01 I CCCC DDDDDDDDD SSSSSSSSS		DJZD	D,S/#

--MS		1010111 10 I CCCC DDDDDDDDD SSSSSSSSS		DJNZ	D,S/#
--MS		1010111 11 I CCCC DDDDDDDDD SSSSSSSSS		DJNZD	D,S/#

--LS		1111010 0L I CCCC DDDDDDDDD SSSSSSSSS		JP	D/#,S/#
--LS		1111010 1L I CCCC DDDDDDDDD SSSSSSSSS		JPD	D/#,S/#

--LS		1111011 0L I CCCC DDDDDDDDD SSSSSSSSS		JNP	D/#,S/#
--LS		1111011 1L I CCCC DDDDDDDDD SSSSSSSSS		JNPD	D/#,S/#

--RS		1111100 00 I CCCC DDDDDDDDD SSSSSSSSS		JZ	D,S/#
--RS		1111100 01 I CCCC DDDDDDDDD SSSSSSSSS		JZD	D,S/#

--RS		1111100 10 I CCCC DDDDDDDDD SSSSSSSSS		JNZ	D,S/#
--RS		1111100 11 I CCCC DDDDDDDDD SSSSSSSSS		JNZD	D,S/#

--RS		1111101 00 I CCCC DDDDDDDDD SSSSSSSSS		JPOS	D,S/#
--RS		1111101 01 I CCCC DDDDDDDDD SSSSSSSSS		JPOSD	D,S/#

--RS		1111101 10 I CCCC DDDDDDDDD SSSSSSSSS		JNEG	D,S/#
--RS		1111101 11 I CCCC DDDDDDDDD SSSSSSSSS		JNEGD	D,S/#

--L-		1111111 xx L CCCC DDDDDDDDD x10010000		CALLA	D/#
--L-		1111111 xx L CCCC DDDDDDDDD x10010100		CALLAD	D/#

--L-		1111111 xx L CCCC DDDDDDDDD x10010001		CALLB	D/#
--L-		1111111 xx L CCCC DDDDDDDDD x10010101		CALLBD	D/#

--L-		1111111 xx L CCCC DDDDDDDDD x10010010		CALLAR	D/#
--L-		1111111 xx L CCCC DDDDDDDDD x10010110		CALLARD	D/#

--L-		1111111 xx L CCCC DDDDDDDDD x10010011		CALLBR	D/#
--L-		1111111 xx L CCCC DDDDDDDDD x10010111		CALLBRD	D/#

ZC--		1111111 ZC x CCCC xxxxxxxxx x11010000		RETA
ZC--		1111111 ZC x CCCC xxxxxxxxx x11010100		RETAD

ZC--		1111111 ZC x CCCC xxxxxxxxx x11010001		RETB
ZC--		1111111 ZC x CCCC xxxxxxxxx x11010101		RETBD

ZC--		1111111 ZC x CCCC xxxxxxxxx x11010010		RETAR
ZC--		1111111 ZC x CCCC xxxxxxxxx x11010110		RETARD

ZC--		1111111 ZC x CCCC xxxxxxxxx x11010011		RETBR
ZC--		1111111 ZC x CCCC xxxxxxxxx x11010111		RETBRD

I am only using about half of them right now, but unfortunately I can see uses for the others in hand crafted optimized code.

So far, whenever I was writing P2 assembly code, I could:

- usually use 3 delay slots (about 60% of the time)
- almost always use 2 delay slots (>90% of the remaining 40% not covered above)

So basically, 96%+ of the time my branches only took 1 cycle.

Mind you, it took a fair bit of re-factoring and re-organization to get that level of single-cycle brancing usage, but 80%+ was easy to achieve.

cgracey · 2013-12-09 18:02

rogloh wrote: »

Probably it is just a personal taste thing but having two COG registers acting as the PC just seems a little weird to me. I would have much preferred one register being the PC (eg. $1F0) and the other being the proposed LR ($1F1), as it's just tidier. Maybe it has to be done like that to get the instruction count down but it still feels strange.

It is a little weird, but there's no other way to handle both cancelling and non-cancelling branches that modify the PC directly..

cgracey · 2013-12-09 18:04

Bill Henning wrote: »

Hi Chip,

This is the list I extracted from the latest zip, I hope I got them all!

C.W. said there are 23 'D'elayed instructions... which ones did I miss?

ZCWS		1010100 ZC I CCCC DDDDDDDDD SSSSSSSSS		JMPRET	D,S/#		(set D to %1_1111_01xx for JMP/RET)
ZCWS		1010101 ZC I CCCC DDDDDDDDD SSSSSSSSS		JMPRETD	D,S/#		(set D to %1_1111_01xx for JMP/RET)

--MS		1010110 00 I CCCC DDDDDDDDD SSSSSSSSS		IJZ	D,S/#
--MS		1010110 01 I CCCC DDDDDDDDD SSSSSSSSS		IJZD	D,S/#   

--MS		1010110 10 I CCCC DDDDDDDDD SSSSSSSSS		IJNZ	D,S/#
--MS		1010110 11 I CCCC DDDDDDDDD SSSSSSSSS		IJNZD	D,S/#

--MS		1010111 00 I CCCC DDDDDDDDD SSSSSSSSS		DJZ	D,S/#
--MS		1010111 01 I CCCC DDDDDDDDD SSSSSSSSS		DJZD	D,S/#

--MS		1010111 10 I CCCC DDDDDDDDD SSSSSSSSS		DJNZ	D,S/#
--MS		1010111 11 I CCCC DDDDDDDDD SSSSSSSSS		DJNZD	D,S/#

--LS		1111010 0L I CCCC DDDDDDDDD SSSSSSSSS		JP	D/#,S/#
--LS		1111010 1L I CCCC DDDDDDDDD SSSSSSSSS		JPD	D/#,S/#

--LS		1111011 0L I CCCC DDDDDDDDD SSSSSSSSS		JNP	D/#,S/#
--LS		1111011 1L I CCCC DDDDDDDDD SSSSSSSSS		JNPD	D/#,S/#

--RS		1111100 00 I CCCC DDDDDDDDD SSSSSSSSS		JZ	D,S/#
--RS		1111100 01 I CCCC DDDDDDDDD SSSSSSSSS		JZD	D,S/#

--RS		1111100 10 I CCCC DDDDDDDDD SSSSSSSSS		JNZ	D,S/#
--RS		1111100 11 I CCCC DDDDDDDDD SSSSSSSSS		JNZD	D,S/#

--RS		1111101 00 I CCCC DDDDDDDDD SSSSSSSSS		JPOS	D,S/#
--RS		1111101 01 I CCCC DDDDDDDDD SSSSSSSSS		JPOSD	D,S/#

--RS		1111101 10 I CCCC DDDDDDDDD SSSSSSSSS		JNEG	D,S/#
--RS		1111101 11 I CCCC DDDDDDDDD SSSSSSSSS		JNEGD	D,S/#

--L-		1111111 xx L CCCC DDDDDDDDD x10010000		CALLA	D/#
--L-		1111111 xx L CCCC DDDDDDDDD x10010100		CALLAD	D/#

--L-		1111111 xx L CCCC DDDDDDDDD x10010001		CALLB	D/#
--L-		1111111 xx L CCCC DDDDDDDDD x10010101		CALLBD	D/#

--L-		1111111 xx L CCCC DDDDDDDDD x10010010		CALLAR	D/#
--L-		1111111 xx L CCCC DDDDDDDDD x10010110		CALLARD	D/#

--L-		1111111 xx L CCCC DDDDDDDDD x10010011		CALLBR	D/#
--L-		1111111 xx L CCCC DDDDDDDDD x10010111		CALLBRD	D/#

ZC--		1111111 ZC x CCCC xxxxxxxxx x11010000		RETA
ZC--		1111111 ZC x CCCC xxxxxxxxx x11010100		RETAD

ZC--		1111111 ZC x CCCC xxxxxxxxx x11010001		RETB
ZC--		1111111 ZC x CCCC xxxxxxxxx x11010101		RETBD

ZC--		1111111 ZC x CCCC xxxxxxxxx x11010010		RETAR
ZC--		1111111 ZC x CCCC xxxxxxxxx x11010110		RETARD

ZC--		1111111 ZC x CCCC xxxxxxxxx x11010011		RETBR
ZC--		1111111 ZC x CCCC xxxxxxxxx x11010111		RETBRD

I am only using about half of them right now, but unfortunately I can see uses for the others in hand crafted optimized code.

So far, whenever I was writing P2 assembly code, I could:

- usually use 3 delay slots (about 60% of the time)
- almost always use 2 delay slots (>90% of the remaining 40% not covered above)

So basically, 96%+ of the time my branches only took 1 cycle.

Mind you, it took a fair bit of re-factoring and re-organization to get that level of single-cycle brancing usage, but 80%+ was easy to achieve.

Thanks for posting that list. I, too, think they'd all find use, so better to leave them alone.

Bill Henning · 2013-12-09 18:13

Under the weird but possible category...

What about using writing to INA/INB/INC for the non-cancelling (delayed) version? I can't think of a good usage case for needing to write to INA...

cgracey wrote: »

If you look at the latest instruction encodings, every opcode that didn't need WZ/WC has had those bits repurposed to pack instructions tighter.

Bill Henning · 2013-12-09 18:14

You are most welcome.

And thank you for keeping the delayed instructions around... much appreciated.

cgracey wrote: »

Thanks for posting that list. I, too, think they'd all find use, so better to leave them alone.

ersmith · 2013-12-09 18:20

David Betz wrote: »

As Bill has pointed out correct, even the LR version of CALL is not absolutely necessary. It can be implemented by starting each C function with a pop of the return address off the AUX stack into an LR register. It would just be a little more space and time efficient to have the CALL_LR instruction or whatever it would be called.

Actually that depends on the AUX memory not being used for anything else (like video). It seems to me that the LR version is safer for that reason, although obviously having both would be beneficial depending on circumstances.

cgracey · 2013-12-09 18:21

Bill Henning wrote: »

Under the weird but possible category...

What about using writing to INA/INB/INC for the non-cancelling (delayed) version? I can't think of a good usage case for needing to write to INA...

This might work, but I'd have to revisit how JMPRET/JMPRETD works, as they use INA..IND as dummy write targets.

Bill Henning · 2013-12-09 18:27

I just realized some things.

If:

- $1F0 is the PC, and writing to it means a non-cancelling jump
- $1F1 is the cancelling version of PC
- $1EF was made LR (fixed at that address)

then:

- we can get rid of JMP and JMPD, replaced with

MOV PC,#addr and
MOV PCD,#addr

- CALL D/# would write the next cog address to LR instead of the jmpret at the end of the subroutine, then move D/# into PC
- CALLD D/# would write the next cog address to LR instead of the jmpret at the end of the subroutine, then move D/# into PCD

- RET would be replaced with MOV PC,LR
- RETD would be replaced with MOV PCD,LR

The stack versions of cog call/return need not be modified

- some more of the opcodes from the list above could probably be freed by versions of copying to PC or PCD

- we would automatically get 'D' versions of

ADD PC,#val
ADD PCD,#val

SUB PC,#val
SUB PCD,#val

I suspect it would free up some dual op opcodes.

I think this would make it worthwhile to have a permanent LR at $1EF.

edit: unfortunately this would complicate nested non-aux-stack in-cog subroutines. Argh!

David Betz · 2013-12-09 18:55

ersmith wrote: »

Actually that depends on the AUX memory not being used for anything else (like video). It seems to me that the LR version is safer for that reason, although obviously having both would be beneficial depending on circumstances.

True. I tried to make that point as well. It seems a waste to use the AUX stack to hold one value temporarily.

If there isn't a fixed location for LR then we can go back to my original proposal where we add a "SETLR" instruction to set an internal register that remembers which COG address to use as LR. Then the CALL_LR instruction will just use the location indicated by that hidden internal register. I suppose it would be nice to be able to read that register as well although I guess that's not absolutely necessary. It's interesting that two PCs are wanted now to support a delayed branch when just a little while ago there was a proposal to remove all of the delayed jumps.

jazzed · 2013-12-09 19:01

Chip,

Do you envision a COG running a HUB executable program in parallel with other threaded COG code?

I suppose it's not optimal for HUB execute, but it could provide that little extra so the COG can be more fully used.

It's probably a PITA to make work. How would you launch such a COG anyway?

rogloh · 2013-12-09 19:19

jazzed wrote: »

Chip,

Do you envision a COG running a HUB executable program in parallel with other threaded COG code?

I see a good use case for this feature. One could have a micro scheduler task running in COG mode that would be scheduled to run at the lowest frequency (1 in 16 cycles I think). It could then spend most of its time waiting for some elapsed time or condition and stop/reschedule the main hub exec task as required for it to run atomically. Think preemptively (or even co-operatively) tasked hub exec code using hub based memory as locks. This could be very useful indeed. You could even write a mini RTOS this way that all ran in one COG. It would take some COG resources but could be written to run in its own VM to save space. It could also be used as a debugger for example.

David Betz · 2013-12-09 19:25

jazzed wrote: »

Chip,

Do you envision a COG running a HUB executable program in parallel with other threaded COG code?

I suppose it's not optimal for HUB execute, but it could provide that little extra so the COG can be more fully used.

It's probably a PITA to make work. How would you launch such a COG anyway?

I've tried to argue in favor of this but it may be too difficult to implement.

cgracey · 2013-12-09 21:07

jazzed wrote: »

Chip,

Do you envision a COG running a HUB executable program in parallel with other threaded COG code?

I suppose it's not optimal for HUB execute, but it could provide that little extra so the COG can be more fully used.

It's probably a PITA to make work. How would you launch such a COG anyway?

It's actually simplest to not make any task special. It's easier for all tasks to have hub and cog capability than it is for just one. Whatever I do must apply to all tasks. That means all tasks could be in hub mode at the same time, or some could be in hub mode and others could be in cog mode.

David Betz · 2013-12-09 21:23

Chip: Any chance the LR idea will make it into P2? Or has PC/PCD replaced it?

cgracey · 2013-12-09 21:44

David Betz wrote: »

Chip: Any chance the LR idea will make it into P2? Or has PC/PCD replaced it?

It is very possible that it will go in, somehow.

David Betz · 2013-12-09 21:46

cgracey wrote: »

It is very possible that it will go in, somehow.

I really liked the idea of using $1f1 but I understand why it isn't possible. The two PCs are very useful as well.

cgracey · 2013-12-09 22:54

David Betz wrote: »

I really liked the idea of using $1f1 but I understand why it isn't possible. The two PCs are very useful as well.

I spent all day getting the PC mapped into $1F0/$1F1, only to realize that it extended the critical path of the whole chip and slowed it way down. The reason is that the computed ALU result is the last-arriving signal set, and to run it through a few more sets of mux's to accommodate the four task PC's, and then get it out to the cog RAM instruction address input, just takes too long. The only way to circumvent these delays is to add another pipeline stage, which will make cancelling branches take one more clock, and 4-way multitasking branches take two clocks, instead of one. It's not worth it. So, the PCs will have to be addressed by instructions, only, in which the PC result does not go through the main ALU. It was worth trying, though, because the benefits would have been great. I think to compensate, I'll make relative jumps, which are easy to implement without drawbacks. This will give us the same performance we would have had with mappable PC's, when it comes to adding to them.

evanh · 2013-12-10 03:37

I'd like to make an observation. The four hardware threads in each Cog are not very efficient for anything other than hand coded soft-peripherals. They are preset time-sliced and therefore of no benefit to normally prioritised multitasking.

Adding extra hardware, to support them, and particularly removing useful instructions to make these threads function in hubexec mode would not be a good idea imho.

Best just to leave the hardware thread slicing where it was intended to be for now. Maybe look into extending it into HEM for the P3.

cgracey · 2013-12-10 03:41

evanh wrote: »

I'd like to make an observation. The four hardware threads in each Cog are not very efficient for anything other than hand coded soft-peripherals. They are preset time-sliced and therefore of no benefit to normally prioritised multitasking.

Adding extra hardware, to support them, and particularly removing useful instructions to make these threads function in hubexec mode would not be a good idea imho.

Best just to leave the hardware thread slicing where it was intended to be for now. Maybe look into extending it into HEM for the P3.

None of that is going away. It's just that now any task will able to be execute from hub.

cgracey · 2013-12-10 03:41

evanh wrote: »

I'd like to make an observation. The four hardware threads in each Cog are not very efficient for anything other than hand coded soft-peripherals. They are preset time-sliced and therefore of no benefit to normally prioritised multitasking.

Adding extra hardware, to support them, and particularly removing useful instructions to make these threads function in hubexec mode would not be a good idea imho.

Best just to leave the hardware thread slicing where it was intended to be for now. Maybe look into extending it into HEM for the P3.

None of that is going away, afterall. It's just that now any task will able to be execute from hub.

Hub Execution Model Thread (split from blog)

Comments