Propeller II update - BLOG

Seairth · 2014-02-26 14:28

Bill Henning wrote: »

All,

Threads will need a way of voluntarily suspending themselves until some event happens (ie SD data loaded, socket has data)

The simplest way I can think of is to set a "YIELD" bit in a cog location, and loop until that bit is clear

The scheduler can pause the thread when it sees the YIELD bit, and when the waited upon event happens, it can resume the thread after clearing the YIELD bit.

The beauty of this is that no new instructions or logic is required to implement this.

(Think "select" in *nix)

Once Chip adds the ability for one task to set another task's flag bits, you could use JZ/JNZ.

rogloh · 2014-02-26 14:34

Bill Henning wrote: »

The most common scenario I see is:

task 0 - scheduler
task 1 - controlled by the scheduler, runs threads

Agree this will likely be very common. Also having the 2 other tasks in the COG generally freed up will be useful for adding the following capabilities if desired:

1) debugger & I/O driver - eg debugger waiting on an I/O pin for control from a serial port to start/stop/step, inject breakpoints etc
2) profiler - wakes up periodically and captures the PC of the main running task for performance analysis. This could potentially be combined with the scheduler function too.

Roger

Bill Henning · 2014-02-26 14:41

Seairth wrote: »

Once Chip adds the ability for one task to set another task's flag bits, you could use JZ/JNZ.

Sorry, that won't work.

The scheduler task can't just arbitrarily set the flag on a running task, as it would clobber a potentially important state at a bad moment.

It would have to work something like this:

' this is the thread that wants to yield
           or   YIELD,#1
pause tjnz YIELD,#pause
' yield is just a cog register

Then the scheduler can check for yield

           or  YIELD,#0
if_nz   TPAUSE thread_state,#1  ' assuming the thread is running in task 1
if_nz   mov YIELD,#0  ' clear the flag so when we re-start the thread it will exit the tjnz loop
'
' re-schedule threds and do other work

and when it is time to wake up the thread that wanted to yield

         TRESUME thread_state,#1

Doing it this way will cause the tjnz loop to exit on the next iteration, as the YIELD register would have been cleared to zero by the scheduler.

Doing it this way does not need any special flags or registers to implement yielding. We should probably use a known register for YIELD, I nominate $1F1

Bill Henning · 2014-02-26 14:46

Hi Roger,

I agree, profiling probably belongs in the scheduler task.

A debugger is a good use for task #3, and #4 could always be used for I/O... however for best performance of the threads, I'll stick with just scheduler & threading task (except when debugging)

rogloh wrote: »

Agree this will likely be very common. Also having the 2 other tasks in the COG generally freed up will be useful for adding the following capabilities if desired:

1) debugger & I/O driver - eg debugger waiting on an I/O pin for control from a serial port to start/stop/step, inject breakpoints etc
2) profiler - wakes up periodically and captures the PC of the main running task for performance analysis. This could potentially be combined with the scheduler function too.

Roger

mindrobots · 2014-02-26 15:07

For us 'nano bound' folks, for the next release would it be possible to have one without video and more of the "goodies" that have been dropped? Or is the video component not a significant portion of real estate.

I must be a strange prop user, I hardly ever use video.

Thanks!

Ken Gracey · 2014-02-26 16:02

mindrobots wrote: »

I must be a strange prop user, I hardly ever use video.

Thanks!

Of all of our high-volume Propeller users [which means 100-2K units a year], only a few of them use video capability. Some customers put a composite connector on the PCB for debugging, system tests and checks, but they also do not use it in the actual application.

You might be strange in other ways, but your Prop programming behavior is similar to many customers.

Ken Gracey

mindrobots · 2014-02-26 16:34

Ken Gracey wrote: »

You might be strange in other ways, but your Prop programming behavior is similar to many customers.

Ken Gracey

Hey, Ken!

Thanks for clearing that up!

Cluso99 · 2014-02-26 17:17

Ken,
Interesting results Ken. We don't have video on our commercial 3 prop boards.

Re the DE0, my testing efforts do not use video (or any of the DACs), so if this gets us some of the other parts back in its place, this would be an excellent build for me too. But we would need both because ozpropdev and baggers will both want video.

Cluso99 · 2014-02-26 17:29

A very interesting overnight (for me) thread re starting the cog with COGNEWX/COGRUNX. It makes sense to me to have the basic boot code in hub rom (I have been suggesting this) and I didn't see any need to use COGNEW/COGRUN as the hubexec program would load the cog and start it by jmp $0.

Second point is starting with instruction cache enabled. Agreed this should be the default.

As for the preemptive multitasking, I am already happy with what we have already without this.

Chip,
May I ask for a release sooner rather than later. A lot of us are in limbo waiting for the new release.

It may in fact prove more beneficial for us to be able to trial the new tasking and the feedback for preemptive may be more beneficial rather than trying to shove in something in haste. Personally I would rather the USB instructions and SERDES took priority over preemptive multitasking.

May I ask for the one USB instruction (I can hand code a long so pnut doesn't require the mod yet) in the release pretty please??? Then I can get to the next stage of USB while I have time. CRC can come a little later.

Reason: Add new pin-pair instruction for use with USB bit-banging receive (similar to GETP/GETNP)
        The S value (sub-instruction bits) "yyyyyyyy" would use the next available slot after CACHEX
Thread: [URL="http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222515&viewfull=1#post1222515"][COLOR=#4366fb]http://forums.parallax.com/showthrea...=1#post1222515[/COLOR][/URL]
1111111 ZC L CCCC DDDDDDDDD xyyyyyyyy       GETXP   [#]D [WZ],[WC]  ' set flags for the pin-pair for usb bit-banging  
                                                                    '   D = PINx (0..127), PINy := PINx XOR $1 (it's complementary pin-pair)
                                                                    '   C = C XOR PINx via WC
                                                                    '   Z = !(PINx OR PINy) via WZ (ie ZERO if both PINx and PINy are both ZERO == SE0 in USB)
PINx and PINy are a pair of pins. If PINx is even then PINy := PINx + 1 else if PINx is odd then PINy := PINx - 1
The allowance for the PINx/PINy pair to be reversed is for USB LS & HS where J/K are effectively swapped between D-/D+.
WZ & WC would normally be used.

Finally, don't we only have one set of PTRA/PTRB registers, not a set for each task? I thought they were on the critical path and could not be expanded to one set per task.

jmg · 2014-02-26 17:36

Cluso99 wrote: »

It may in fact prove more beneficial for us to be able to trial the new tasking and the feedback for preemptive may be more beneficial rather than trying to shove in something in haste. Personally I would rather the USB instructions and SERDES took priority over preemptive multitasking.

I think what was called preemptive multitasking, has simplified down along these lines :

In post #5314 Chip was talking about SW scheduler, and says "This doesn't take any more hardware than an instruction to get and set the task state."

ie the same opcode (pair?) that is very useful for Debug as well.

ozpropdev · 2014-02-26 17:38

Cluso99 wrote: »

Re the DE0, my testing efforts do not use video (or any of the DACs), so if this gets us some of the other parts back in its place, this would be an excellent build for me too. But we would need both because ozpropdev and baggers will both want video.

Ray
Most of my P2 work is on a DE2 now, so Nano losing video won't effect me.
At the end of the day I'm happy to work with whatever variants/releases are available.
IIRC Baggers got his DE2 going again so Nano video removal may not effect him either.
Cheers
Brian

ozpropdev · 2014-02-26 17:58

Cluso99 wrote: »

Finally, don't we only have one set of PTRA/PTRB registers, not a set for each task? I thought they were on the critical path and could not be expanded to one set per task.

Ray
See here
Brian

potatohead · 2014-02-26 18:44

I don't see that as strange at all.

P1 video capability is really cool. I've an interest in those things, which is why I started back in on this stuff with P1 chips. But, really using the chip video for basic design requirements is likely to use more resources than practical. P2 will not have the same issues overall, so perhaps on chip video will see greater adoption across more projects / products.

Bob Lawrence (VE1RLL) · 2014-02-26 18:53

re:I must be a strange prop user, I hardly ever use video.

I love playing with video. A prop with no video is like a TV with no picture.

I still enjoy running the balls video demo LOL If we loose video on the Nano I'll have to find a way to upgrade my hardware or stick with my current software release(2014_02_06) .

mindrobots · 2014-02-26 19:11

Bob Lawrence (VE1RLL) wrote: »

re:I must be a strange prop user, I hardly ever use video.

I love playing with video. A prop with no video is like a TV with no picture. I still enjoy running the balls video demo LOL If we loose video on the Nano I'll have to find a way to upgrade my hardware or stick with my current software release(2014_02_06) .

I'm not suggesting we lose video. I'm suggesting a second Nano build without video if it would make room for some of the lost features that no longer fit. This wouldn't need to happen until there is a candidate for silicon so more people can test more things.

I don't want anyone to lose their balls or no longer be able to play.

cgracey · 2014-02-27 12:45

I've been looking into what it takes to completely redirect a task, so that preemptive multitasking and single-stepping can be accomplished. It turns out that the following bits need to be saved and restored:

16 bits for PC
1 bit for Z flag
1 bit for C flag
18 bits for PTRA
18 bits for PTRB
1 bit for TLOCK pending
2 bits for delayed branch pending
16 bits for delayed branch address
23 bits for AUGS value
1 bit for AUGS pending
23 bits for AUGD value
1 bit for AUGD pending
46 bits for REPS/REPD

167 bits total = 5 longs, 7 bits

That's a lot of data needed to store a task state!

How about instead of being able to stop a task at any point in its program, we have a circuit that waits for an opportune situation before stopping the task. If we waited for the following, we would only need to track PC/Z/C and PTRA/PTRB:

TLOCK is not pending (this potentially causes a 1-instruction delay)
a delayed branch is not pending (this potentially causes a 3-instruction delay)
AUGS/AUGD is not pending (this potentially causes a 1..2 instruction delay)
REPS/REPD in not active (this potentially causes an unknown delay)

By avoiding those circumstances, we eliminate 113 bits of state information that needs saving and restoring, bringing the total down to 54 bits, of which JMPTASK can restore 18 (Z/C/PC) and operand-less instructions can copy the target task's PTRA/PTRB to and from the switcher task's PTRA/PTRB. This would take very little hardware. It would completely enable preemptive multitasking, but would increase the granularity of single-stepping in cases where TLOCK, AUGS/AUGD, or a delayed branch is pending, or where REPS/REPD is active. Single-stepping would step over those cases as if they were one instruction.

Do you think this is adequate, or should the full 167 bits be handled in order to provide more granular single-stepping, as well as REPS/REPD interruption?

Sapieha · 2014-02-27 12:58

Hi Chip.

In my opinion.

> Made it as simple as possible - Remaining things can be done in software if needed

Bill Henning · 2014-02-27 13:02

I think that is more than adequate!

But I am confused... why is there a need to store PTRA/PTRB if they are copied to the scheduler's PTRA/B? The scheduler could save PTRA/B somewhere.

The scheduler could save it itself, thus saving 36 bits of state (unless I am missing something) bringing the state down to 18 bits again (PC,Z,C)

While debugging code, if there is ANY doubt about a REPx block, it can be turned into a regular DJNZ loop for debugging purposes.

Regarding delayed branches, that is simply a step size.. and for testing, they can be replaced with the non-delayed version.

Another state we forgot about... the 4 deep LIFO stack.

I think the easiest solution is to have the scheduling task map in PTRA/B and LIFO from the task it just halted with TPAUSE, and have it restore them just before TRESUME.

cgracey wrote: »

I've been looking into what it takes to completely redirect a task, so that preemptive multitasking and single-stepping can be accomplished. It turns out that the following bits need to be saved and restored:

16 bits for PC
1 bit for Z flag
1 bit for C flag
18 bits for PTRA
18 bits for PTRB
1 bit for TLOCK pending
2 bits for delayed branch pending
16 bits for delayed branch address
23 bits for AUGS value
1 bit for AUGS pending
23 bits for AUGD value
1 bit for AUGD pending
46 bits for REPS/REPD

167 bits total = 5 longs, 7 bits

That's a lot of data needed to store a task state!

How about instead of being able to stop a task at any point in its program, we have a circuit that waits for an opportune situation before stopping the task. If we waited for the following, we would only need to track PC/Z/C and PTRA/PTRB:

TLOCK is not pending (this potentially causes a 1-instruction delay)
a delayed branch is not pending (this potentially causes a 3-instruction delay)
AUGS/AUGD is not pending (this potentially causes a 1..2 instruction delay)
REPS/REPD in not active (this potentially causes an unknown delay)

By avoiding those circumstances, we eliminate 113 bits of state information that needs saving and restoring, bringing the total down to 54 bits, of which JMPTASK can restore 18 (Z/C/PC) and operand-less instructions can copy the target task's PTRA/PTRB to and from the switcher task's PTRA/PTRB. This would take very little hardware. It would completely enable preemptive multitasking, but would increase the granularity of single-stepping in cases where TLOCK, AUGS/AUGD, or a delayed branch is pending, or where REPS/REPD is active. Single-stepping would step over those cases as if they were one instruction.

Do you think this is adequate, or should the full 167 bits be handled in order to provide more granular single-stepping, as well as REPS/REPD interruption?

ctwardell · 2014-02-27 13:06

From a multitasking point of view the short version is probably best, less time to switch context and I think most threaded code would be fine without using REPS/REPD since that could skew timing.

From the single-step point of view, the long version is best because of the step overs in the short version.

I guess it's a matter of what is most important.

I guess that isn't very helpful...

C.W.

jmg · 2014-02-27 13:11

cgracey wrote: »

.. This would take very little hardware. It would completely enable preemptive multitasking, but would increase the granularity of single-stepping in cases where TLOCK, AUGS/AUGD, or a delayed branch is pending, or where REPS/REPD is active. Single-stepping would step over those cases as if they were one instruction.

Do you think this is adequate, or should the full 167 bits be handled in order to provide more granular single-stepping, as well as REPS/REPD interruption?

I think 'special cases' can be handled with simulation in the debugger. (or code re-write/conditionals in the short term)
If single step does step-over, that sounds like a no-surprises outcome. ie the code does not lose the plot.

If the Debug-step can capture time on either side of a Step, (should be a couple of lines of SW?) then it can report how long the step actually took.
With time-info, it is then more obvious that a 'longer step-over' occurred, and that info is also useful for debug anyway.

This time capture may run foul of task-slice phase - is there enough control in the Task Map and Step, to always 'fire' the (mostly) single step with a consistent delay, so the dT captures are cycle accurate ? (fixed offsets are ok, variable ones less so)

dT would also be useful where the user wanted to step-over themselves, or time-flight between two break points.

cgracey · 2014-02-27 13:15

Bill Henning wrote: »

I think that is more than adequate!

But I am confused... why is there a need to store PTRA/PTRB if they are copied to the scheduler's PTRA/B? The scheduler could save PTRA/B somewhere.

The scheduler could save it itself, thus saving 36 bits of state (unless I am missing something) bringing the state down to 18 bits again (PC,Z,C)

While debugging code, if there is ANY doubt about a REPx block, it can be turned into a regular DJNZ loop for debugging purposes.

Regarding delayed branches, that is simply a step size.. and for testing, they can be replaced with the non-delayed version.

Another state we forgot about... the 4 deep LIFO stack.

I think the easiest solution is to have the scheduling task map in PTRA/B and LIFO from the task it just halted with TPAUSE, and have it restore them just before TRESUME.

Wow! Your idea about simply remapping the TPAUSED task's PTRA/PTRB into the scheduler's PTRA/PTRB, as well as the LIFO is really simple to implement and even less logic. I need to make an instruction which gets any task's {Z,C,PC} into D, while JMPTASK can re-establish those values later.

cgracey · 2014-02-27 13:21

jmg wrote: »

I think 'special cases' can be handled with simulation in the debugger. (or code re-write/conditionals in the short term)
If single step does step-over, that sounds like a no-surprises outcome. ie the code does not lose the plot.

If the Debug-step can capture time on either side of a Step, (should be a couple of lines of SW?) then it can report how long the step actually took.
With time-info, it is then more obvious that a 'longer step-over' occurred, and that info is also useful for debug anyway.

This time capture may run foul of task-slice phase - is there enough control in the Task Map and Step, to always 'fire' the (mostly) single step with a consistent delay, so the dT captures are cycle accurate ? (fixed offsets are ok, variable ones less so)

dT would also be useful where the user wanted to step-over themselves, or time-flight between two break points.

Tracking time might be simple if the before-step and after-step code's timing was known. You would take an initial CNT reading, do the single-step, get a CNT delta using SUBCNT, and then subtract a constant to account for the before-step and after-step code. You could even make sure that the before- and after-code were properly registered to the hub cycle to maintain apparent hub fidelity within the single-stepped code.

Bill Henning · 2014-02-27 13:31

Thanks!

I'd suggest:

TPAUSE savereg ' Z,C,PC saved into savereg, could even be fixed to $1F1

TRESUME savereg ' Z,C,PC loaded from savereg, could even be fixed to $1F1

No need for a new JMPTASK, if there is a need to re-start the thread at a different point then where it saved, savereg could be modified before the thread is resumed.

The thread could also started by setting the start address and initial C/Z values in savereg, and simply TRESUME'ing the thread.

HMMM

Perhaps better names would be:

TPAUSE savereg, #1..3 ' pause currently running thread, save PC, C, Z into savereg, map PTRA/B, LIFO into task
TRUN savereg, #1..3 ' start a new thread, or continue paused thread. Thread gets PC, C, Z from current task

Single stepping could be simply doing a TPAUSE, and then performing a 'TSTEP savereg,#1..3' to execute only a single instruction.

I wonder if it would be simpler for TSTEP to run all four cycles of the next instruction (and not pipeline following instructions) - this should solve the Delay issue...

FYI, only one instruction is needed... TPAUSE/TRUN/TSTEP could be distinguised by unused bits in the source bit area, or even WC/WZ

cgracey wrote: »

Wow! Your idea about simply remapping the TPAUSED task's PTRA/PTRB into the scheduler's PTRA/PTRB, as well as the LIFO is really simple to implement and even less logic. I need to make an instruction which gets any task's {Z,C,PC} into D, while JMPTASK can re-establish those values later.

cgracey · 2014-02-27 13:32

ctwardell wrote: »

From a multitasking point of view the short version is probably best, less time to switch context and I think most threaded code would be fine without using REPS/REPD since that could skew timing.

From the single-step point of view, the long version is best because of the step overs in the short version.

I guess it's a matter of what is most important.

I guess that isn't very helpful...

C.W.

It IS helpful. You pointed out that the short version would take less time for a context switch. That is a big deal which I didn't get, at first. If you had to handle 5+ longs every time you stopped a task, that would be a lot more to deal with than just waiting for a more opportune point in the next few instructions when there's going to be way less data to handle. I think for this reason, alone, it makes sense. It would be nice to be able to break into REPS/REPD, but it would be expensive in terms of data saving/restoring. I think REPS/REPD will just be used for short bursts, anyways.

Bill Henning · 2014-02-27 13:34

Addendum:

In order to keep the thread/task separation clear, I'd suggest

THHALT savereg,#1..3
THRUN savereg,#1..3
THSTEP saveregreg,#1..3

It's actually pretty amazing that a flexible multi-threading model *PLUS* debugging can be implemented with just three helper instructions and some code.

And the instructions could be easily encoded into a single dual-reg opcode!

jmg · 2014-02-27 14:01

cgracey wrote: »

Tracking time might be simple if the before-step and after-step code's timing was known. You would take an initial CNT reading, do the single-step, get a CNT delta using SUBCNT, and then subtract a constant to account for the before-step and after-step code. You could even make sure that the before- and after-code were properly registered to the hub cycle to maintain apparent hub fidelity within the single-stepped code.

I figured time-capture itself would be simple.
I was less clear if the opcode Bill is calling
THSTEP saveregreg,#1..3

would be cycle predictable.
If you have a 100% slot usage for the Debug-kernal, and then release one slot for Step, if that is done via the task-map, how does the Debug know which slot it is currently in, and exactly when the Step will occur ?

Or does THSTEP avoid using the normal task mapper, and fire a one-shot for the task being debugged ?

Bill Henning · 2014-02-27 14:21

I thought about it some more, and there is no need for the 'H' - these instructions affect the task, the thread is purely a software construct built with their capabilities!

TSTOP savereg,#1..3 ' only called by the scheduler task
TRUN savereg,#1..3 ' only called by the scheduler task
TSTEP saveregreg,#1..3 ' only called by the scheduler task
TWAIT #n ' new instruction! explanation below - NOT TO BE CALLED BY SCHEDULER

There are two other usage cases that should be addressed:

1) A task/thread executing a breakpoint

2) A thread voluntarily yielding as it is waiting for some event (time, signal, socket, etc)

As a breakpoint can be considered as the thread waiting for the debugger, I think one instruction can handle all of the above.

In all of these cases, the thread has to get the attention of the scheduler. We can do this without adding any logic!

TWAIT #n ' write N to $1F1, and wait forever (TSTOP will stop the task, and TRUN will resume at the next address, right after the TWAIT)

We have two easy to use locations in a cog - that are not normally loaded.

$1F1 - TWAIT value
$1F0 - savereg

So basically, the scheduler will in its scheduling loop do the equivalent of:

TJNZ $1F1, #thread_waiting

and code can then decode the reason the thread is waiting, which can be one of:

- breakpoint (say 0..255)

- waiting for a signal/event/timeout (indicated by 256..511)

Note the signal values are totally arbitrary.

TWAIT completes the set - allows for threads to yield, to wait for elapsed time, and also gives us breakpoints!

cgracey · 2014-02-27 14:35

jmg wrote: »

If you have a 100% slot usage for the Debug-kernal, and then release one slot for Step, if that is done via the task-map, how does the Debug know which slot it is currently in, and exactly when the Step will occur ?

SETTASK writes a pattern to the task register which rotates right with each instruction, wrapping at the point just below where the leading %00 is. Always, the two LSBs dictate the task to execute (unless TLOCK is in effect, in which case the task register stays still and the prior task repeats until TFREE. So, to do a single-step, you would do something like 'SETTASK #%%10 followed by SETTASK #%%1 (assuming task 0 was the target task and task 1 was the scheduler task.

jmg · 2014-02-27 14:38

Bill Henning wrote: »

...
TWAIT completes the set - allows for threads to yield, to wait for elapsed time, and also gives us breakpoints!

All sounds good to me.

Bill Henning wrote: »

So basically, the scheduler will in its scheduling loop do the equivalent of:

TJNZ $1F1, #thread_waiting

I don't think that wait for Break, is 1 cycle granular ?
it would be nice if the time between breaks could be cycle accurate.

Bill Henning · 2014-02-27 14:42

Thanks.

I don't think that the TWAIT grain matters, as it will take many cycles for the scheduler to notice it, and handle it.

I am just happy it would give us a YIELD equivalent, essentially unlimited breakpoints, AND a mechanism to implement select() easily!

jmg wrote: »

All sounds good to me.

I don't think that wait for Break, is 1 cycle granular ?
it would be nice if the time between breaks could be cycle accurate.

Propeller II update - BLOG

Comments