Propeller II

cgracey · 2012-08-11 01:58

Sapieha wrote: »

Hi Chip.

Ih Chip I have be thinking what You said that That D-long field are hard wired in MOVF.
You now have that for for field mover.
SETF D/# - set up field mover
%w_xxdd_yyss

Vhy You don't have 32 bit wide to have even D-Long address in this
%E_MSSSSSSSSS_MDDDDDDDDD_rr_w_xxdd_yyss

E = Use S, D pointers
M = 0 - Main COG memory address, 1 - LUT memory area address
rr = Spare Bits maybe for 10=increment, 11=decremen D-Long.

E = Use S, D pointers -- 0 = use only w_xxdd_yyss field control
1 = use entire extended capabilitys

That eve leave 2 bits free -- To maybe have 10=increment, 11=decrement that possibilitys on D-Long.

With that we have complete BYTE handler for entire COG-LUT memory area

That would be great, but the problem is that by the time the pipeline is ready to execute the MOVF instruction, D and S are already set, and it's too late to go back and index cog RAM or the CLUT. This would have to be worked into the pipeline and that is too sticky of a thing to attempt at this point.

cgracey · 2012-08-11 02:03

evanh wrote: »

Apologies, I'd not tried to understand that code example. I'm not familiar with Cog assembly having not actually written any. I assumed it was a way of loading the next LMM context but that's not the case is it, it's more like a sequence of gotos with only the smallest of context and can all fit in a single Cog, right?

Makes my request look a little bloated.

EDIT: Oh, The whole set of LMM contexts are contained in the Cog, right? ... I think need to learn a little more and shut up now ...

Every time a TASKSW is executed, the program counter and flags are saved in the current context register (C0..C7 in the example) and the next program counter and flag set are loaded from the next context register. Basically, you run a bit of your code and then execute a TASKSW to switch over to the next program. Eventually, you'll begin executing again right after the TASKSW, after everyone else has had a turn.

I'm not sure what your last question is about (LMM).

Sapieha · 2012-08-11 02:09

Hi Chip.

MOVF yes
BUT in time SETF executes are pipeline are it to late to?

cgracey wrote: »

That would be great, but the problem is that by the time the pipeline is ready to execute the MOVF instruction, D and S are already set, and it's too late to go back and index cog RAM or the CLUT. This would have to be worked into the pipeline and that is too sticky of a thing to attempt at this point.

evanh · 2012-08-11 02:11

Thanks.

cgracey wrote: »

I'm not sure what your last question is about (LMM).

Primarily it's in reference to the preallocated section of the general register set that each task gets. Typically, it won't be offloaded when switching - saving the time to reload that a generalised task switcher would normally deal with. Keeping that thrashing to a minimum.

Heater. · 2012-08-11 02:12

Chip,
I cannot see the complete code on my phone here but that tasksw looks really sweet.
Now that you have a context switching mechanism is there a simple way to get task switch to happen automatically on every instruction? So two tasks would be able to run at half normal rate each. No overhead of having to read and execute a tasksw instruction. To keep it simple there would be no priority mechanism.
In fact it would be nice for the task switch to happen after every instruction time even if the instruction has not finished. Then multiple tasks could be waiting on different events, pin or time or vid.

Sapieha · 2012-08-11 02:15

Hi Chip.

For JUMP I think You need flush pipeline WHY not flush it to that instruction ???

evanh · 2012-08-11 02:19

Heater. wrote: »

In fact it would be nice for the task switch to happen after every instruction time even if the instruction has not finished. Then multiple tasks could be waiting on different events, pin or time or vid.

And still only one set of general registers ... hmm, sounds less expensive than my idea but, like Chips example, it relies on sharing the Cog space. Not bad.

I guess these are really all academic ideas as they are significant work to implement.

Heater. · 2012-08-11 02:33

Hah yes, it's easy for us spectators to have all kind of "simple" suggestions without realizing they may require a major redesign to implement.

cgracey · 2012-08-11 02:35

Heater. wrote: »

Chip,
I cannot see the complete code on my phone here but that tasksw looks really sweet.
Now that you have a context switching mechanism is there a simple way to get task switch to happen automatically on every instruction? So two tasks would be able to run at half normal rate each. No overhead of having to read and execute a tasksw instruction. To keep it simple there would be no priority mechanism.
In fact it would be nice for the task switch to happen after every instruction time even if the instruction has not finished. Then multiple tasks could be waiting on different events, pin or time or vid.

I wish I had thought about this earlier, because it might have been somewhat trivial to have an array of 8 program counters and z/c flags that could be switched among. Man, that's pretty compelling! Ask yourself this: if instructions floated through the pipeline that each represented a different pc/z/c, would it matter, as long as appropriate pc/z/c's were updated at the end of each instruction? Would the registers care? I don't think so, but it would take a little consideration to know for sure.

As TASKSW works right now, it's actually a JMPRET instruction, so it takes 4 clocks (1 to execute plus 3 to reload the pipeline).

cgracey · 2012-08-11 02:38

Sapieha wrote: »

Hi Chip.

For JUMP I think You need flush pipeline WHY not flush it to that instruction ???

Because things would not be in proper order and you'd have to do some patchwork to get things back on track after the instruction. It would be too disruptive, I think.

cgracey · 2012-08-11 02:41

evanh wrote: »

Thanks.

Primarily it's in reference to the preallocated section of the general register set that each task gets. Typically, it won't be offloaded when switching - saving the time to reload that a generalised task switcher would normally deal with. Keeping that thrashing to a minimum.

There aren't any pre-allocated registers for anything, really. It's all how you want to code it up.

evanh · 2012-08-11 02:51

cgracey wrote: »

There aren't any pre-allocated registers for anything, really. It's all how you want to code it up.

Yeah, pre-allocated by LMM multitasking implementation only.

Heater. · 2012-08-11 02:52

Not my idea of course. It comes from XMOS. They have a four stage pipeline and each stage can be working on an instruction from a different thread. Result is that upto four threads run at same rate as one. After that they can go up to eight threads but execution rate of threads starts to slow accordingly. But importantly timing determinism is maintained not that they like you to rely on instruction counting for timing but rather use other mechanisms on the chip like clocked I/O.

No idea how your pipelines look and thinking about such things gives me headache but even if this automatic task switching mode only worked for a max of two threads it would be great. Then it would be like having an upto 16 COG propeller at half speed.

jmg · 2012-08-11 02:53

cgracey wrote: »

I wish I had thought about this earlier, because it might have been somewhat trivial to have an array of 8 program counters and z/c flags that could be switched among. Man, that's pretty compelling! Ask yourself this: if instructions floated through the pipeline that each represented a different pc/z/c, would it matter, as long as appropriate pc/z/c's were updated at the end of each instruction? Would the registers care? I don't think so, but it would take a little consideration to know for sure.

Where this could get knarly, is around jumps which flush the pipeline ?
The Prop should allow a time-slice relatively easily, as it has no interrupts and no stack.

When doing this, make sure the user can allocate the 8 slices in a simple setup table. (24 bits?)
Default as all slots mapped to one, gives 100% to one thread, and it works 'normally'

We looked at a threaded design from China, that hard-sliced three ways, (actually included three memory blocks, so this had a reasonable silicon overhead, but it means std tools see a std resource set) and sadly we could not fit (time-wise) code that would have worked great as 2/3 : 1/3 allocate.
Just plain failed on their 1/3:1/3:1/3 hardwired no-choice. Silicon cost was tiny, I think they simply forgot to do this.

evanh · 2012-08-11 03:02

jmg wrote: »

Where this could get knarly, is around jumps which flush the pipeline ?

I think it would be done with separate pipes along with the separate PC/Flags. Each one would have independent fills/refills.

cgracey · 2012-08-11 03:03

Heater. wrote: »

Not my idea of course. It comes from XMOS. They have a four stage pipeline and each stage can be working on an instruction from a different thread. Result is that upto four threads run at same rate as one. After that they can go up to eight threads but execution rate of threads starts to slow accordingly. But importantly timing determinism is maintained not that they like you to rely on instruction counting for timing but rather use other mechanisms on the chip like clocked I/O.

No idea how your pipelines look and thinking about such things gives me headache but even if this automatic task switching mode only worked for a max of two threads it would be great. Then it would be like having an upto 16 COG propeller at half speed.

One nice thing about having at least as many tasks as your pipeline is deep: you don't need any data forwarding, because different tasks must respect each others' register space, so they're not affecting the same registers. Wait... this wouldn't be true for I/O registers - you'd really want data forwarding there (ie XOR PINA,#1). You could get around this by giving each task it's own set of I/O reg's which all get OR'd together, like happens among cogs now.

Heater. · 2012-08-11 03:08

Hmm..sounds like you have got it sorted:)

cgracey · 2012-08-11 03:10

jmg wrote: »

Where this could get knarly, is around jumps which flush the pipeline ?

Bingo! That's probably the only issue. That could be solved by cancelling only instructions in the pipe that belong to THAT task. That's not too complicated. I've even got some spare op-code space to set it up in. Don't tell anybody.

cgracey · 2012-08-11 03:12

evanh wrote: »

I think it would be done with separate pipes along with the separate PC/Flags. Each one would have independent fills/refills.

No, it actually could use the same pipe, just float two bits through the pipeline stages that identify which-of-four tasks it is.

Heater. · 2012-08-11 03:20

Chip,
"Don't tell anybody"
Err...have we just added half a year to P2 development time?!

evanh · 2012-08-11 03:21

jmg wrote: »

When doing this, make sure the user can allocate the 8 slices in a simple setup table. (24 bits?)
Default as all slots mapped to one, gives 100% to one thread, and it works 'normally'

I would go four threads with a 16 slot 32 bit table initially containing #0 value giving thread zero 100%. This allows a nice extended slicing map.

evanh · 2012-08-11 03:21

Heater. wrote: »

Chip,
"Don't tell anybody"
Err...have we just added half a year to P2 development time?!

Lol!

cgracey · 2012-08-11 03:29

evanh wrote: »

I would go four threads with a 16 slot 32 bit table initially containing #0 value giving thread zero 100%. This allows an nice extended slicing map.

That would be very clean!

To add something like this may not take more than one day, and it would add maybe several hours to the synthesis work, at this point, at $175/hr.

For this to work, you would have to avoid using instructions like WAITxxx or REPS that either stall or mess with the pipeline. A stall would just be ugly, with respect to other tasks, but instructions that toy with the pipeline would wreak havoc. 'Just some stuff you'd need to take into consideration when programming multiple tasks. And you'd have to avoid resource conflicts, like who's using INDA/INDB/PTRA/PTRB. Memory accesses would cause brief stalls. The cache wouldn't mind, though.

Dave Hein · 2012-08-11 03:30

Now just drive the task swap from a timer or an external signal on a pin and you got an interrupt.

cgracey · 2012-08-11 03:34

Dave Hein wrote: »

Now just drive the task swap from a timer or an external signal on a pin and you got an interrupt.

Too much! Head going to explode!

Heater. · 2012-08-11 03:37

Not being able to have different threads waiting on different things would be a shame. But I guess one of the main points of the wait instructions is to get into a low power consumption state. Given that while a thread might wait other threads continue to run perhaps that does not apply.
Sounds like this threading thing is worth going for.

Heater. · 2012-08-11 03:39

Dave,
"... you have got an interrupt"

Nooooo....

evanh · 2012-08-11 03:48

Dave Hein wrote: »

Now just drive the task swap from a timer or an external signal on a pin and you got an interrupt.

Hehe, no, screws the determinism.

But WAITs are important. Need the WAITs. Must have WAITs!

cgracey · 2012-08-11 03:57

evanh wrote: »

Hehe, no, screws the determinism.

But WAITs are important. Need the WAITs. Must have WAITs!

These are all the waits there are: WAITVID, WAITCNT, WAITPEQ, WAITPNE. Are they so important? You could poll for all but WAITVID.

Heater. · 2012-08-11 04:00

Hmm.. That's the thing, no interrupts but event driven. If a thread can wait on a pinor whatever it effectively becomes an interrupt handler. Except that when the event fires and the thread continues it has no effect on the execution of other threads. After all there is no context to save, it has its own, and it does not steal execution time. Determinism is maintained.

Propeller II

Comments