Observations of a multi-tasker

David Betz · 2013-09-19 09:56

User Name wrote: »

I think this discussion is excellent, ozpropdev! A slightly complicated time-critical application is just what a new architecture needs. Coding techniques are developed and potential architectural improvements are discovered.

I agree. Now if only we could find a way to get you a DE2-115 board. Think what you could do with that!

pedward · 2013-09-19 11:01

Let me fix that for you.

cgracey · 2013-09-19 12:13

Since we are going to respin the synthesized block, I'm making some minor changes to the Verilog code.

Is there a compelling reason to allow a variable number of time slots, instead of just 16. For example, we could allow 15 time slots, which would be good for 3 equal tasks. Any opinions?

My new Win8 machine is starting to let me on the internet today, but not very reliably. It's got its priorities, you know.

Heater. · 2013-09-19 12:30

Chip,

I'm kind of worried about making unnecessary changes just now.
Actually I don't think I understand what you are asking re: the time slots or how it would help with ozpropdev's issues.

No idea what's up with your Windows 8. I recently had to install Windows 7 on a friends laptop from the Fujitsu install CD. It surprised me in not doing any weird stuff. I did feel dirty aftwards though.

Bill Henning · 2013-09-19 12:45

At the risk of having Ken feint... I agree with Heater, I also think it would be better to avoid any changes (other than fixes) until after there is a working Prop2.

cgracey wrote: »

Since we are going to respin the synthesized block, I'm making some minor changes to the Verilog code.

Is there a compelling reason to allow a variable number of time slots, instead of just 16. For example, we could allow 15 time slots, which would be good for 3 equal tasks. Any opinions?

My new Win8 machine is starting to let me on the internet today, but not very reliably. It's got its priorities, you know.

cgracey · 2013-09-19 12:55

This Win8 machine has eaten my explanation post.

mindrobots · 2013-09-19 12:57

Chip,

I'm not sure I fully understand.

My thoughts/understanding is that once you inject a second (or third or fourth) task into the timeslot mix, you've stalled the pipeline.

My question is how different, performance wise, would a 3 task 15 slot timeslice register of: 012012012012012 be from a 16 slot register of: 0120120120120120 - giving one more kick to task 0?

The HUB timing is another issue. I'm not sure about it at all.

cgracey · 2013-09-19 13:04

mindrobots wrote: »

Chip,

I'm not sure I fully understand.

My thoughts/understanding is that once you inject a second (or third or fourth) task into the timeslot mix, you've stalled the pipeline.

My question is how different, performance wise, would a 3 task 15 slot timeslice register of: 012012012012012 be from a 16 slot register of: 0120120120120120 - giving one more kick to task 0?

The HUB timing is another issue. I'm not sure about it at all.

The trouble is that the dangling time slot that needs to be assigned somewhere causes pipeline order indeterminancy which prevents you from being able to use the delayed branches, since you won't know how many instructions to put after JMPD, for example,

Yanomani · 2013-09-19 13:08

cgracey wrote: »

Since we are going to respin the synthesized block, I'm making some minor changes to the Verilog code.

Is there a compelling reason to allow a variable number of time slots, instead of just 16. For example, we could allow 15 time slots, which would be good for 3 equal tasks. Any opinions?

My new Win8 machine is starting to let me on the internet today, but not very reliably. It's got its priorities, you know.

Chip, only for clarification purposes, if one assigns 15 time slots for a COG's thread-scheduler, to easy a threefold division, and perhaps 16 for some other COG, targeted to an ideal fourfold one, if their routines are to be synced at some meaningful interval, will this mean that a coincidence will occur only at 16 x 15 = 240 clocks?
Moreover, since HUB's cycling is always fixed and does not allow for some shortcut methods to be applied, effectively skipping a block of selectively disabled COGs, if one intends to pass some data between two different COG's routines, will not this option just complicate timming awareness to even worse levels?

I believe Pandora's box never closed at all; like my country's government attitude, its face keeps being uglier and everyday there are even bigger snakes spanning from its head.

Yanomani

mindrobots · 2013-09-19 13:14

So if you have 2 or 4 tasks, you get a 16 slot timeslicer and if you have 3 tasks, you get a 15 slot timeslicer? Then everything can be deterministic?

cgracey · 2013-09-19 13:16

What I was trying to post above:

I have a list of simple enhancements and a bug fix, most if which I've implemented, already:

1) GETPIX can handle 5:5:5 pixels so that you can fit two into D. This effectively doubles the pixel bandwidth for SDRAM, since now you can get two rendered pixels into a long, instead of a single 8:8:8 pixel.
2) Added a few more constant modes for QSINCOS. We had 1.0 and 7/8. I added 255/256 and 3/4 modes.
3) I changed CALLD/CALLAD/CALLBD so that rather than have the return address be next PC + 3, it now adds up the same-task instructions in the pipeline, which will be from 0 to 3. By making it smart like this, you can take advantage of delayed calls in multitasking code. This is where, say, 15 timeslots become important for 3 tasks, so that you have pipeline order determinancy.

There are two bugs I'm fixing:

1) Adding async clears to the flops that output DIR and OUT directly to the I/O pins. This is a known problem with the current silicon.
2) Making REPS/REPD abort if its current task is being affected by a JMPTASK instruction. This is a known bug in the Verilog.

Two more things I'm thinking of adding:

1) Variable number of timeslots (16 or 15), which will make 3 even tasks possible, and also allow them to exploit delayed branches for greater efficiency.
2) Unique REPS/REPD circuitry for each task. All tasks will be able to do their own REPS/REPD, instead of just one task.

cgracey · 2013-09-19 13:24

These changes are all very minor and not in the critical paths, so everyone will know. They are easily tested, too.

About the even 3-task scheduling: It could be done by writing all 1's to the timeslot register (unlikely to be otherwise purposeful). When that happens, it goes into a 0,1,2,0,1,2,0,1,2,... mode, since three even tasks is a very special case. 16 timeslots are fine, on the other hand, for evenly dividing 1, 2, or 4 tasks. Three tasks is a special case, though.

ctwardell · 2013-09-19 13:27

Chip,

My input is that it could be useful and if it is a very low risk change I would go for it.

Chris Wardell

Sapieha · 2013-09-19 13:28

Hi Chip.

My thinking was not to be active in this thread -- But as You asked that question I need mention one of my ideas to made variable length of HUB slots.

That some COG's don't have access to HUB at all --- Only communicate with others by internal Port

cgracey wrote: »

Since we are going to respin the synthesized block, I'm making some minor changes to the Verilog code.

Is there a compelling reason to allow a variable number of time slots, instead of just 16. For example, we could allow 15 time slots, which would be good for 3 equal tasks. Any opinions?

My new Win8 machine is starting to let me on the internet today, but not very reliably. It's got its priorities, you know.

Heater. · 2013-09-19 13:40

I just don't see how tweaking the thread scheduling slots helps with the issue of HUB ops piling up at an instant and delaying a thread by 24 clocks (or whatever it is) whilst it waits for the threads in front of it to get their HUB ops.

Or is this about something else altogether?

cgracey · 2013-09-19 14:00

Heater. wrote: »

I just don't see how tweaking the thread scheduling slots helps with the issue of HUB ops piling up at an instant and delaying a thread by 24 clocks (or whatever it is) whilst it waits for the threads in front of it to get their HUB ops.

Or is this about something else altogether?

It's only about keeping the pipeline ordered so that you know how many instructions to put after a delayed branch. That's all. It won't effect the timing resulting from hub ops.

Baggers · 2013-09-19 14:09

Hi Chip,
Awesome news that you added the optional GETPIX spewing out 5:5:5 mode for D

I like the 15 timeslots idea for cogs with three threads running too!
Some really nice mods going in.

I have a question about the timing of the hub-ops on each thread

Say you have four threads, for this purpose, all were in the middle of copying data from Hub to cogram

thread1-4 were all in tight reps doing one rdlong instruction

How would they be processed? and for this example, the first hub-op would be ok for thread1 to hit, i.e. not stall.

would it be

    Thread1    Thread2    Thread3    Thread4    threadprocessed
0   rdlong       -           -          -            1
1   cycle2       -           -          -            1
2   cycle3       -           -          -            1
3   -                stall           -               -                   2
4   -                -                 stall         -                   3
5   -                -                 -               stall             4
6   stall          -                 -               -                   1
7   -                stall           -               -                   2
0   -                -                 rdlong     -                   3
1   -                -                 cycle2     -                   3
2   -                -                 cycle3     -                   3
3   -                -                 -               stall             4
4   stall          -                 -               -                   1
5   -                stall           -               -                   2
6   -                -                 stall         -                   3
7   -                -                 -               stall             4

and repeating so that just thread 1 and 3 would complete their hub-ops without threads 2 and 4 getting any action until thread1 and thread3 finished their run?

Sorry for the bad spacing

cgracey · 2013-09-19 14:34

Baggers wrote: »

...I have a question about the timing of the hub-ops on each thread

Say you have four threads, for this purpose, all were in the middle of copying data from Hub to cogram

thread1-4 were all in tight reps doing one rdlong instruction

How would they be processed?

It would look like this:

All four task executing RDLONGs:

Hub	Task0	Task1	Task2	Task3
-------------------------------------
3	STALL	-	-	-
4	STALL	-	-	-
5	STALL	-	-	-
6	STALL	-	-	-
7	STALL	-	-	-
0	RDLONG	-	-	-
1	STALL	-	-	-
2	STALL	-	-	-
3	-	STALL	-	-
4	-	STALL	-	-
5	-	STALL	-	-
6	-	STALL	-	-
7	-	STALL	-	-
0	-	RDLONG	-	-
1	-	STALL	-	-
2	-	STALL	-	-
3	-	-	STALL	-
4	-	-	STALL	-
5	-	-	STALL	-
6	-	-	STALL	-
7	-	-	STALL	-
0	-	-	RDLONG	-
1	-	-	STALL	-
2	-	-	STALL	-
3	-	-	-	STALL
4	-	-	-	STALL
5	-	-	-	STALL
6	-	-	-	STALL
7	-	-	-	STALL
0	-	-	-	RDLONG
1	-	-	-	STALL
2	-	-	-	STALL
<repeat>

Baggers · 2013-09-19 14:44

Thanks for clearing that up Chip

I wasn't sure if the stall would continue with the next thread or not.

Circuitsoft · 2013-09-19 15:00

Could hob-ops have an asynchronous mode where one instruction makes a request then a later instruction retrieves the result while the operation happens in the background? Also, if the request op could be made to take "0 instructions" ie have it happen in parallel with the instruction that follows it, then it shouldn't slow down any existing algorithms.

Yanomani · 2013-09-19 15:02

Sapieha wrote: »

Hi Chip.

My thinking was not to be active in this thread -- But as You asked that question I need mention one of my ideas to made variable length of HUB slots.

That some COG's don't have access to HUB at all --- Only communicate with others by internal Port

Hi Sapieha

I'll second you on that!

It'll take the whole virtual peripheral's concept even further ahead, since the inception of CLUT's memory area for each COG to play with Rx and Tx buffers, I was dreaming about a shadow COG driven virtual peripheral.
I'll only have to humbly add a sugestion to create a way to test in software for the arrival of new values writen from one COG to other one's internal port, and the corresponding reversal test that enables the writing one to check if previous data was read by the receiving one, using the wz or wc side effect.
Better yet if we can split the port into two 16 bit single duplex channels, with corresponding shadow interlock flags, but i believe that this is too much for a humble pledge!

Yanomani

Phil Pilgrim (PhiPi) · 2013-09-19 15:02

Re: 15 timeslots.

When the number of thread timeslots and the number of cycles between hub-access slots are relatively prime, it will be much harder to determine the sweet spot for hub accesses. That's because one "epoch" between equally-spaced hub accesses for a given thread becomes much longer.

-Phil

Yanomani · 2013-09-19 15:19

Hi Chip

Just before I'll attract Ken's full wrath to my big mouth and frenetic fingers, I have one more suggestion to make.

If not yet done, please, put some pulldown or pullup resistors on the four unconnected pins from P92 to P95. They'll become nice internaly reacheable semaphores for all the COG's to play with.

Yanomani

Heater. · 2013-09-19 15:25

Sapieha, Yanomani,

...ideas to made variable length of HUB slots.

This idea has been floated since the beginning of time. On the face of it it's great. I mean, in the extreme 7 COGs might never access HUB. Either not running at all or only dealing with I/O on pins. That means the one COG doing HUB access could always have access immediately. No HUB round robin to wait for. A great boost to it's performance.

But wait...

That means I could post code to the OBEX that relies on that behavior. It might only work if it gets more than it's "one in eight" shot at hub.

That means that someone dropping my code into their project might find it does not work for them because their project does not offer enough free HUB slots.

All of a sudden we don't have independence, in timing, between objects. When I mix and match objects from here and there I cannot be sure they will work unless I carefully check the timing dependencies. It's like being back in the bad old interrupt driven days when I would have to check my application and and all it's interrupt handlers have time to execute as required. Nightmare.

All in all, I don't think the trade off is worth it. It destroys the predictability and modularity of Propeller code by messing with the timing independence of modules.

A COG should be a COG and always behave the same no matter what other COGs are doing. If not we have destroyed a fundamental feature of the Propeller.

Ariba · 2013-09-19 15:32

cgracey wrote: »

What I was trying to post above:

I have a list of simple enhancements and a bug fix, most if which I've implemented, already:

1) GETPIX can handle 5:5:5 pixels so that you can fit two into D. This effectively doubles the pixel bandwidth for SDRAM, since now you can get two rendered pixels into a long, instead of a single 8:8:8 pixel.
2) Added a few more constant modes for QSINCOS. We had 1.0 and 7/8. I added 255/256 and 3/4 modes.
3) I changed CALLD/CALLAD/CALLBD so that rather than have the return address be next PC + 3, it now adds up the same-task instructions in the pipeline, which will be from 0 to 3. By making it smart like this, you can take advantage of delayed calls in multitasking code. This is where, say, 15 timeslots become important for 3 tasks, so that you have pipeline order determinancy.

There are two bugs I'm fixing:

1) Adding async clears to the flops that output DIR and OUT directly to the I/O pins. This is a known problem with the current silicon.
2) Making REPS/REPD abort if its current task is being affected by a JMPTASK instruction. This is a known bug in the Verilog.

Two more things I'm thinking of adding:

1) Variable number of timeslots (16 or 15), which will make 3 even tasks possible, and also allow them to exploit delayed branches for greater efficiency.
2) Unique REPS/REPD circuitry for each task. All tasks will be able to do their own REPS/REPD, instead of just one task.

Hello Chip

1) If you make changes to the Verilog please add a way to read out the ACCA/ACCB with fixed scaling. Without that you loose so many instructions for DSP algorythm like IIR filters, that the MACx instructions are nearly useless for such algorithms.

I think one of the most promising niches for the Prop 2 is it's use in DSP applications. There are only very few DSP around with the MMACS (million MACS per second) power of the Prop2 and then they are either 16bit DSPs or very expensive. And none of the competitors can do such fast MACs 8 times on the same chip.

The ideal added instruction would do:
- shift the 64bit ACCx arithmetic right (SAR) by 16..20 bits.
- saturate the result to the maximal positive and negative value for 16..20 bits.
- write the result to the destination register

SARACCx dst,#bits

The saturation can also be done with MAXS and MINS, but that needs 2 instructions more.

If this is too complicated, then just an instructions that does something like FITACCx, but with a
fixed scaling by 18 (like SCL) would also help alot. Such an instruction needs no src and dst field, and
should fit therefore easy in the existing instruction encoding:

SCLACCA
SCLACCB

2) You have once said that you better had made two separate registers INX and OUTx instead of the PINx.
Do you consider to make these changes in the next Verilog version ?
If not then another way to read out the output latches would help alot.

Andy

Yanomani · 2013-09-19 16:06

Heater. wrote: »

Sapieha, Yanomani,

This idea has been floated since the beginning of time. On the face of it it's great. I mean, in the extreme 7 COGs might never access HUB. Either not running at all or only dealing with I/O on pins. That means the one COG doing HUB access could always have access immediately. No HUB round robin to wait for. A great boost to it's performance.

But wait...

That means I could post code to the OBEX that relies on that behavior. It might only work if it gets more than it's "one in eight" shot at hub.

That means that someone dropping my code into their project might find it does not work for them because their project does not offer enough free HUB slots.

All of a sudden we don't have independence, in timing, between objects. When I mix and match objects from here and there I cannot be sure they will work unless I carefully check the timing dependencies. It's like being back in the bad old interrupt driven days when I would have to check my application and and all it's interrupt handlers have time to execute as required. Nightmare.

All in all, I don't think the trade off is worth it. It destroys the predictability and modularity of Propeller code by messing with the timing independence of modules.

A COG should be a COG and always behave the same no matter what other COGs are doing. If not we have destroyed a fundamental feature of the Propeller.

Heater

Although I'll tend theoretically to agree with your observation, from a pure software perspective, it's not impossible to create some directive to express this need at assembly time.
Something like:

Heap request = n, where 0 <= n <= max.

For the present implementation, max equals eight.
For each COG's routine, this is to be done at the very beggining of the code block.
During assembly time, the sum of all heap requests at a given time must not be allowed to surpass the available max setting, that relies on the present Propeller's implementation.
If max is to be exceeded, an error will be warned.
If, at a latter time, Chip will bring us a 16 COG Propeller version, or a dual HUB switcher, perhaps with different rotational speeds and COG slot skipping settings, then new scheduling options will be available.

Yanomani

cgracey · 2013-09-19 17:08

Ariba wrote: »

Hello Chip

1) If you make changes to the Verilog please add a way to read out the ACCA/ACCB with fixed scaling. Without that you loose so many instructions for DSP algorythm like IIR filters, that the MACx instructions are nearly useless for such algorithms.

I think one of the most promising niches for the Prop 2 is it's use in DSP applications. There are only very few DSP around with the MMACS (million MACS per second) power of the Prop2 and then they are either 16bit DSPs or very expensive. And none of the competitors can do such fast MACs 8 times on the same chip.

The ideal added instruction would do:
- shift the 64bit ACCx arithmetic right (SAR) by 16..20 bits.
- saturate the result to the maximal positive and negative value for 16..20 bits.
- write the result to the destination register

SARACCx dst,#bits

The saturation can also be done with MAXS and MINS, but that needs 2 instructions more.

If this is too complicated, then just an instructions that does something like FITACCx, but with a
fixed scaling by 18 (like SCL) would also help alot. Such an instruction needs no src and dst field, and
should fit therefore easy in the existing instruction encoding:

SCLACCA
SCLACCB

2) You have once said that you better had made two separate registers INX and OUTx instead of the PINx.
Do you consider to make these changes in the next Verilog version ?
If not then another way to read out the output latches would help alot.

Andy

Andy,

Yes!!! This is something that I remember you bringing up a while ago and I know it will make DSP a lot better.

We have 20x20 bit signed multipliers for the MAC instructions, as you know, so by taking bits [top..18] of the result, we preserve significant digits, right? So, 18 is the singly magic number of shifts, though 20..16 would be a nice range, correct? I will make some changes to accommodate this.

I thought about going to separate IN/OUT registers, but it's working in Spin2 so nicely now with your XOR advice, that it seems unnecessary to mess with. PINS, alone, seems fine for assembly programming. What do you think?

Ken Gracey · 2013-09-19 17:16

Ariba wrote: »

Hello Chip
1) If you make changes to the Verilog please add a way to read out the ACCA/ACCB with fixed scaling. . .
Andy

Nothing against Andy or the suggestion as it may be very beneficial.

Ears perked up a bit on this kind of request. I realize there may be tremendous benefits to some changes.

But they have to be considered in relation to the opportunity cost of not completing the project. Eight years is a long time, and Parallax has much to consider when changes are made. Parallax is one place where extended R&D has no consequences other than those that might matter the most: serious financial considerations, which can grind us to a stop if we are unable to derive revenue from our investments.

Just a word of caution, that's all. . . I'll look into swapping out the Starbucks coffee for some Yuban next week.

Cluso99 · 2013-09-19 17:19

Chip:
Is there some simple way that you could allow serial input to work?
Something like putting a gate to input to the VGA registers and be able to read the VGA registers, or
some way of applying a counter clock to the high speed interprop comms?

We could take care of start and stop bit detection etc with software, but some slight additional hardware using
the existing serial silicon would be fantastic.

Cluso99 · 2013-09-19 17:26

Yanomani: Unfortunately the variable hub cycle access has been beaten to death.

I agree that it would be nice, and respectfully disagree with heater because those using the variable hub access would take on all the issues involved. I think there would be a number of apps that could make efficient use of additional hub cycles and it could be done in such a way to allow some cogs the normal deterministic access.

Observations of a multi-tasker

Comments