Propeller II update - BLOG

jmg · 2014-03-07 20:29

Bill Henning wrote: »

TRESUME does increment the PC, in the currently discussed version

It would be possible to instead have TPAUSE taskreg,#code exit its internal looping when taskreg is 0 (after it has written the non-zero code), then TRESUME would not need to increment the PC. If this was done, it might be possible to get rid of the TRESUME instruction?

Yes, that gives a natural release mechanism, that is within existing silicon behaviour.

The TPAUSE then needs to behave a bit like two opcodes in one :
On first clock it writes a Non zero value to Testloc, and then on all following clocks it loops reading TestLoc

Seems such an opcode would support breakpoint handling, and may have more general inter-task uses ?

Yes, this lowers the need for TRESUME.

Cluso99 · 2014-03-07 20:51

I was thinking for the debugging where you utilised TPAUSE as a breakpoint. After the debugger has done whatever is required, it will want to allow the traced task to resume. There are two cases for resuming...
1. The breakpoint remains to pause the next time this instruction has been hit
2. The breakpoint is removed (maybe another has been added somewhere else)

In (1) the old instruction must be re-inserted in place of the TPAUSE, the traced task is then permitted to run 1 instruction, and then the TPAUSE is reinserted, and finally the program is permitted to continue.

In (2) the old instruction must be re-inserted in place of the TPAUSE, and the traced task is then permitted to continue.

So, in (1) the first TRESUME would not inc the PC, and would permit 1 instruction to execute. The next TRESUME would inc the PC.

And, in (2) the TRESUME would not inc the PC.

Hence why I thought the TRESUME would be excellent if it could optionally inc the PC or not inc the PC.

jmg · 2014-03-07 21:05

Cluso99 wrote: »

So, in (1) the first TRESUME would not inc the PC, and would permit 1 instruction to execute. The next TRESUME would inc the PC.

I agree two handling cases are needed, but the suggestion Bill made, I believe supports two handling cases without needing an explicit TRESUME opcode.

The change in register value triggers resume in one case, and in the swapped-opcode case, the PC does execute-then-advance.

An additional appeal of this approach, is a task totally manages it own PC(ie no patches and no surprises)

cgracey · 2014-03-07 21:41

This idea of making TPAUSE interactive is really interesting. It would save a bunch of steps to get it running again. Also, as long as it's looping, nothing is changing, so perhaps some state data can be read stably. This would mean breakpoints for every task, while task 3 would still have that extra circuitry to facilitate preemptive threading.

jmg · 2014-03-07 21:53

cgracey wrote: »

This idea of making TPAUSE interactive is really interesting. It would save a bunch of steps to get it running again. Also, as long as it's looping, nothing is changing, so perhaps some state data can be read stably. This would mean breakpoints for every task, while task 3 would still have that extra circuitry to facilitate preemptive threading.

Yes, I like the write-then-poll interactive model. It is one ocode and atomic.

I'm just now wondering if the use can be made even more general with some choices on exit case.

Test of Zero seems a little blunt, and it may be useful to pass more information on resume.
(and Zero test kills that option)

Maybe the # part of the param, could specify choices of what is tested, or would wait-on-Bit31 operation allow all other bits to be used as params in both directions ? ( wait only really needs a single bit, not all 32 )

In the wait-choice case, 2 bits of the TPAUSE # Param could say wait on one of (eg) Bits 31.30.29.28, and now a master task can release any mix or 3 waiting tasks, in a single line.
Bits 0..27 are not tested for release and can pass other information

cgracey · 2014-03-08 03:14

whicker wrote: »

pedward,

Except that TPAUSE doesn't exactly yield.

When someone yields the podium, they don't end up standing there with a fixed expression until someone else picks them up and forcibly takes them down from the stage.

At the roundabout, TPAUSE would be for the car to continue circling around until another car bumped it into the grass area in middle.

These are really funny analogies, and quite apt.

I think I have everything done for multi-threading now. I just need to test it all out.

At this point, I've decided not to make the TPAUSE instruction do anything other than write a value to some register and loop to itself. I wrote code to test it with TRESUME and it works very nicely, as is, with very little management code. If I were to make TPAUSE complicated, so that it knows if it's running for the first time or not, it would require new state information that would need to be tracked. It's elegant the way it is.

Ariba · 2014-03-08 04:25

This is how PASD handles Breakpoints on the Propeller 1 since 7 years.

...
    jmpret pc_reg,#breakp    'breakpoint
    ...


breakp
    pushzc flags
    <send  pc and flags to Debugger>
    <get original instr at breakpoint from Debugger>
    <wait for continue command from Debugger)
    popzc flags
    <execute original instr here>
    jmp pc_reg      'resume

A breakpoint just jumps to a breakpoint-handler that saves the flags and communicates with the Debugger (with the help of another cog).
On continue it executes the original instruction that was at the breakpoint location and continues behind the breakpoint.

The breakpoint handler also allows to read and write any cog and hub location, so the debugger has full control over memories.
On the Prop 1 we needed a little Debug-Kernel compiled in the code to test (12 longs) and a second cog with the communication code, on the Prop2 the handler can be in HubRAM and we need no other cog, only a few cog registers.

The breakpoint handler is executed in the same task that gets debugged, so it should be possible to show the states of the registers in all threads (exept INDx which are not readable).

Andy

ErNa · 2014-03-08 04:26

I'm feeling like taking part in the Olympic Games as an spectator! Just keep going, I can not follow ;-) When two years ago at UPEW some people sitting together were discussing what could be done .... noone could imagine what can be done! Great! ErNa

Cluso99 · 2014-03-08 04:54

Andy,
The problem with P2 is that not all instructions can be executed from a different address. For example, any of the relative jump/call instructions, any instruction following augs/d.

So we have to at least put relative jump and calls back into the break instruction. I am not sure if the TSAVE saves any AUGS/D already in play.

cgracey · 2014-03-08 05:30

Cluso99 wrote: »

Andy,
The problem with P2 is that not all instructions can be executed from a different address. For example, any of the relative jump/call instructions, any instruction following augs/d.

So we have to at least put relative jump and calls back into the break instruction. I am not sure if the TSAVE saves any AUGS/D already in play.

TSAVE does save those things, along with a lot of other stuff, including INDA, INDB, PTRA, PTRB, PTRX, PTRY, TLOCK pending, delayed jump status, etc. It all comes to 243 bits, not including the WIDEs and the task's LIFO which must be saved separately. My next job is to write a thread switcher with 16 independent programs running, so that this preemptive thing is proven.

mindrobots · 2014-03-08 06:04

cgracey wrote: »

I think I have everything done for multi-threading now. I just need to test it all out.

I love waking up to the smell of fresh sausage!!

We'll be able to taste test soon!!

Bill Henning · 2014-03-08 08:25

I strongly prefer test for zero, and here is why:

- using the 9 bit constant allows us to encode 511 states, without needing an AUGS or specifying a register (that would have to be loaded)

- using a single bit of the nine for the exit from loop condition would limit us to 255 states for breakpoints, wait-for-event-X, and system calls. VERY limiting.

- using 0 makes it very easy for the debugger/scheduler (tjnz)

- KISS principle. 0 is VERY simple, easy to explain and use

- more complex cases can easily be handled, by assigning one of the 511 codes to mean "look at this other register, that is bit-defined"

Whenever I make a suggestion, I try to minimize the hardware required to implement it, and try to push off complexity to software.

jmg wrote: »

Yes, I like the write-then-poll interactive model. It is one ocode and atomic.

I'm just now wondering if the use can be made even more general with some choices on exit case.

Test of Zero seems a little blunt, and it may be useful to pass more information on resume.
(and Zero test kills that option)

Maybe the # part of the param, could specify choices of what is tested, or would wait-on-Bit31 operation allow all other bits to be used as params in both directions ? ( wait only really needs a single bit, not all 32 )

In the wait-choice case, 2 bits of the TPAUSE # Param could say wait on one of (eg) Bits 31.30.29.28, and now a master task can release any mix or 3 waiting tasks, in a single line.
Bits 0..27 are not tested for release and can pass other information

jmg · 2014-03-08 10:52

Bill Henning wrote: »

I strongly prefer test for zero, and here is why:

- using the 9 bit constant allows us to encode 511 states, without needing an AUGS or specifying a register (that would have to be loaded)

- using a single bit of the nine for the exit from loop condition would limit us to 255 states for breakpoints, wait-for-event-X, and system calls. VERY limiting.

- using 0 makes it very easy for the debugger/scheduler (tjnz)

- KISS principle. 0 is VERY simple, easy to explain and use

- more complex cases can easily be handled, by assigning one of the 511 codes to mean "look at this other register, that is bit-defined"

Whenever I make a suggestion, I try to minimize the hardware required to implement it, and try to push off complexity to software.

You skipped over the fact zero cannot signal anything at all back to the re-starting thread(s). ?

To me that lack of symmetry in control, is rather more VERY limiting, than a move from 512 to 255 states the other way.

I'm interested in a real use examples where having 255 states in drop-dead, and 512 is magic ?

Pgmrs are quite used to waiting on booleans now, so I don't quite buy KISS here, and in hardware it is actually simpler to wait on a single bit, than compare 8 bits, so if you really want top push minimal hardware, a bit wins there..

Bill Henning · 2014-03-08 11:08

jmg,

with zero, scheduler/debugger can do this:

[code]
checktasks tjnz #task1service
tjnz #task2service
tjnz #task3 service
[/quit]

To pass something back

task1service long 0
task2service long 1
task3service long 2

task1result long 0
task2result long 0
task3result long 0

Regarding 255 vs 511 signal states:

- more breakpoints
- more wait cases
- more system calls

Frankly, my solution is far simpler, allows more state to be passed back, and simpler to implement in hardware.

Please show a real example where what you are proposing is better / takes less code / can do more.

jmg wrote: »

You skipped over the fact zero cannot signal anything at all back to the re-starting thread(s). ?

To me that lack of symmetry in control, is rather more VERY limiting, than a move from 512 to 255 states the other way.

I'm interested in a real use examples where having 255 states in drop-dead, and 512 is magic ?

Pgmrs are quite used to waiting on booleans now, so I don't quite buy KISS here, and in hardware it is actually simpler to wait on a single bit, than compare 8 bits, so if you really want top push minimal hardware, a bit wins there..

jmg · 2014-03-08 11:56

Bill Henning wrote: »
with zero, scheduler/debugger can do this:
checktasks  tjnz  #task1service
            tjnz  #task2service
            tjnz  #task3 service
To pass something back

task1service long 0
task2service long 1
task3service long 2

task1result long 0
task2result long 0
task3result long 0

I'm not following here, this seems to be a polling loop in Master, but the slave, in TPAUSE cannot change anything, it is passively waiting for the master to release it ?.

ie My understanding is
TPAUSE taskPollreg,#SignalsValue
This writes SignalsValue to taskPollreg, then loops-to-self, until the Master/Scheduler releases it.
The master can check which of presumably many Pauses the slave is in, by reading that value.

In the hypothetical variant case we are discussing (not the FPGA), that TPAUSE loop is a testing-loop-to-self (in current FPGA an explicit separate TRESUME opcode is needed here)

To get a task to move-on, the scheduler writes to change the value in taskPollreg, which the slave can test in later opcodes.
(I suggest writing bit+WhatToDo, you suggest 00H - minor variants on the release detail )

In my case, TPAUSE is a Pause-and-Wait-for-instructions opcode, that passes information both ways across threads in an atomic manner.
If you pass back only 00, you then need to consume another memory for the WhatToDo value, plus another code line to write to that in every release instance.

Given the Pause/Continue signal is purely boolean in nature, it is the most naturally data efficient to use a boolean-test-wait here, which frees other bits for useful work.

The scheduler can then atomically both release and instruct the task what to do next, in one line of code.

If the master has few slots and the slave has many, that Atomic matters - two line releases mean order matters, and is slower.

ctwardell · 2014-03-08 12:20

Bill,

Since the scheduler is watching the taskPollreg for non-zero, how would it differentiate between a non-zero written by a TPAUSE and a non-zero written written by itself in response to a TPAUSE?

Based on what it looks like you want to achieve I prefer jmg's approach of using a single bit to indicate the pause request and the remainder of the bits for signaling data.

C.W.

Bill Henning · 2014-03-08 12:30

jmg wrote: »

I'm not following here, this seems to be a polling loop in Master, but the slave, in TPAUSE cannot change anything, it is passively waiting for the master to release it ?.

ie My understanding is
TPAUSE taskPollreg,#SignalsValue
This writes SignalsValue to taskPollreg, then loops-to-self, until the Master/Scheduler releases it.
The master can check which of presumably many Pauses the slave is in, by reading that value.

In the hypothetical variant case we are discussing (not the FPGA), that TPAUSE loop is a testing-loop-to-self (in current FPGA an explicit separate TRESUME opcode is needed here)

Exactly!

TPAUSE reg,#code

Just writes code to reg, and loops in place.

The first version I suggested would loop forever, counting on the debugger or scheduler to step past it (for the sakes of simplicity)

During our discussion, I realized that if the loop was essentially TJNZ for the client, then it would auto-release when the debugger or scheduler cleared 'reg' - in which case, there is no real need for TRESUME.

jmg wrote: »

To get a task to move-on, the scheduler writes to change the value in taskPollreg, which the slave can test in later opcodes.
(I suggest writing bit+WhatToDo, you suggest 00H - minor variants on the release detail )

In my original suggestion, the loop would be released by the debugger/scheduler modifying the PC to get past it, the TPAUSE was an infinite loop.

jmg wrote: »

In my case, TPAUSE is a Pause-and-Wait-for-instructions opcode, that passes information both ways across threads in an atomic manner.
If you pass back only 00, you then need to consume another memory for the WhatToDo value, plus another code line to write to that in every release instance.

Given the Pause/Continue signal is purely boolean in nature, it is the most naturally data efficient to use a boolean-test-wait here, which frees other bits for useful work.

The scheduler can then atomically both release and instruct the task what to do next, in one line of code.

If the master has few slots and the slave has many, that Atomic matters - two line releases mean order matters, and is slower.

In my earlier response, the debugger/scheduler can pass data back in taskXresult, and it is still effectively atomic as the TPAUSE'd task would not be able to access it until after it was released.[/quote]

client task: (my way)

TYIELD task1request, #SOMEVALUE
next task 1 instruction

debugger/scheduler:

TJNZ #task1handler
TJNZ #task2handler
TJNZ #task3handler

...

task1handler: (my way)
<do something>
mov task1result,#n
mov task1request,#0

With what you propose:

client task: (your way)

TYIELD task1request, #SOMEVALUE | requestbit
next task 1 instruction

debugger/scheduler:

      shr task1request,#bitflag wc nr
if_c jmp #task1handler
      shr task2request,#bitflag wc nr
if_c jmp #task2handler
      shr task2request,#bitflag wc nr
if_c jmp #task1handler

task1handler: 
<do something>
mov task1request,#n ' without request bit

I think my way is far simpler, easier to read, provides more signal values, and needs less logic to implement.

Bill Henning · 2014-03-08 12:34

It is not an issue as:

- scheduler/debugger only takes an action on a non-zero value
- when it has finished with the action, it writes it to zero, which releases the TPAUSE, and the scheduler now ignores it until the next request written there
- alternately, if the scheduler takes the tasks cycles away, it can resume at the instruction following the TPAUSE (original proposal)

There is no need for it to distinguish who wrote a zero at all, as the zero cannot cause an action by the scheduler.

Also, doing TPAUSE task1req,#0 is effectively a NOP, as it does not cause the debugguer/scheduler to do any action (TJNZ)

ctwardell wrote: »

Bill,

Since the scheduler is watching the taskPollreg for non-zero, how would it differentiate between a non-zero written by a TPAUSE and a non-zero written written by itself in response to a TPAUSE?

Based on what it looks like you want to achieve I prefer jmg's approach of using a single bit to indicate the pause request and the remainder of the bits for signaling data.

C.W.

ctwardell · 2014-03-08 12:42

Bill Henning wrote: »

It is not an issue as:

- scheduler/debugger only takes an action on a non-zero value
- when it has finished with the action, it writes it to zero, which releases the TPAUSE, and the scheduler now ignores it until the next request written there
- alternately, if the scheduler takes the tasks cycles away, it can resume at the instruction following the TPAUSE (original proposal)

There is no need for it to distinguish who wrote a zero at all, as the zero cannot cause an action by the scheduler.

Also, doing TPAUSE task1req,#0 is effectively a NOP, as it does not cause the debugguer/scheduler to do any action (TJNZ)

OK, I see your method uses a different register to pass data back.

C.W.

jmg · 2014-03-08 12:42

@ Bill
I am still not following your master loop, what does TJNZ test, and where do I find TJNZ in the P2 docs ?
It still seems to be waiting on the slave to change something, but the slave is paused... ?

I still see no example of the master passing information to the slave, which to me is at least equally common as slave-> master

Bill Henning · 2014-03-08 12:58

Hi jmg,

jmg wrote: »

@ Bill
I am still not following your master loop, what does TJNZ test, and where do I find TJNZ in the P2 docs ?
It still seems to be waiting on the slave to change something, but the slave is paused... ?

TJNZ reg,#addr ' test reg, and if it is non-zero, jump to addr

Umm... TJNZ is available on the P1, I've used it a bunch of times.

The debugger/scheduler will have something like this:

' master scheduler loop, waiting for event - we will ignore pre-empting based on a timer for now
' this is a bare-bones service provider that can serve as a skeleton for a debugger or scheduler

scheduler
       tjnz   task1req, #task1handler
       tjnz   task2req, #task2handler
       tjnz   task3req, #task3handler
       jmp  #scheduler

task1handler
      ' decode the request, and handle it
      mov   task1result,result   ' debugger can pass back results to task1  - if task needs a result
      mov   task1req,#0           ' release task if PC not incremented past TPAUSE
      jmp   #scheduler

task2handler
      ' decode the request, and handle it
      mov   task2result,result   ' optionally pass back result
      mov   task2req,#0           ' release task if PC not incremented past TPAUSE
      jmp   #scheduler

task3handler
      ' decode the request, and handle it
      mov   task3result,result   ' optionally pass back result
      mov   task3req,#0           ' release task if PC not incremented past TPAUSE
      jmp   #scheduler

task1req     long  0    ' task 1places requests here
task1result  long  0   ' debugger can pass back results to task1  - if task needs a result

task2req     long  0    ' task 2places requests here
task2result  long  0   ' debugger can pass back results to task1  - if task needs a result

task3req     long  0    ' task 2places requests here
task3result  long  0   ' debugger can pass back results to task1  - if task needs a result

jmg wrote: »

I still see no example of the master passing information to the slave, which to me is at least equally common as slave-> master

' example of slave invoking a breakpoint, requesting suspension until a signal, or calling system function

         TPAUSE     task1req,#BREAKPOINT12      ' hit breakpoint 12

...

       TPAUSE    task1req,#getch
       MOV          char, task1result

ctwardell · 2014-03-08 12:58

jmg wrote: »

and where do I find TJNZ in the P2 docs ?

It looks like it has been renamed to JNZ from TJNZ which was used on the P1.

C.W..

Bill Henning · 2014-03-08 13:03

Thanks, I must have missed the re-naming

ctwardell wrote: »

It looks like it has been renamed to JNZ from TJNZ which was used on the P1.

C.W..

ctwardell · 2014-03-08 13:12

Bill Henning wrote: »

Thanks, I must have missed the re-naming

I think this was the result of the remapping that Chip did a few months ago.

The new encoding no longer includes the ability to do a WZ, so in effect it isn't doing a 'test' that can later be checked by looking at the Z flag.

C.W.

potatohead · 2014-03-08 13:15

That's exactly it as I understand it too.

jmg · 2014-03-08 13:16

Bill Henning wrote: »

TJNZ reg,#addr ' test reg, and if it is non-zero, jump to addr

Ahh, the extra param and alas opcode helps make the code easier to follow...

Bill Henning · 2014-03-08 13:39

As soon as you asked about TJNZ I figured out why you misunderstood why I need 0 - it makes all the code very clean and easy.

jmg wrote: »

Ahh, the extra param and alas opcode helps make the code easier to follow...

potatohead · 2014-03-08 13:57

Due to how the P1 was built, we are all also adept at looking for these kinds of cases. 'tis a sweet spot.

Bill, this last round of discussions, needed to whittle the whole thing down to that core bit of silicon really turned out rather nice. Thanks for the clear examples.

I'm feeling good about the whole thing right now. Very useful at a moderate complexity cost, and not a whole lot to look out for.

Something I realized on this one is the input to the sausage machine was significantly more than it has been for other functionality, save the HUBEX and caching. Worth it again, IMHO, but I really do want to see us get to the SERDES/USB and close it. By the end of that discussion, which will be significant, we will have been jamming on the image which contains these goodies, now clear and simple enough for mere mortals like me to go off and use.

Of course, Chip's ability to formulate said sausage make it all work in that "Propeller way", and I think that goes without saying, but I am anyway.

jmg · 2014-03-08 13:58

Bill Henning wrote: »

As soon as you asked about TJNZ I figured out why you misunderstood why I need 0 - it makes all the code very clean and easy.

I still do not like the inefficiency of using 32 bits as a boolean.., and the asymmetry of message passing.

mov task3result,result ' optionally pass back result
mov task3req,#0 ' release task if PC not incremented past TPAUSE

- but I also do not see an opcode that neatly allows compact mixing of flags and params

David Betz · 2014-03-08 13:59

It seems like this TPAUSE/TRESUME feature is very close to what is needed to support traps which will be needed for handling TLB misses if we ever get to trying to execute code from external memory through pages cached in hub memory. A TLB miss could automatically pause the task that causes it and jump to some predefined location. It would also need to store a trap reason in another predefined location. The code at that location would then service the trap and possibly modify the state saved by the hardware triggered TPAUSE and then execute a TRESUME on itself to return to the code that was running prior to the trap. I think this could all be done in the context of a single task rather than requiring a scheduler task running in parallel with the task being scheduled. In fact, if you add a timer as a possible trap source then you can do a scheduler within a single task. And Bill's YIELD instruction could essentially be a software trap that is processed as a breakpoint. Single stepping could be done by adding one bit to the state saved by TPAUSE such that another TPAUSE is automatically triggered after a single instruction is executed. I suppose this is essentially introducing interrupts to the P2 but it seems a lot simpler than two tasks running in tandem to effect essentially the same thing.

Propeller II update - BLOG

Comments