Should the next Propeller be code-compatible?

Phil Pilgrim (PhiPi) · 2008-09-02 07:07

Heater,

You're right: My comments at the end of #4 are accurate only in the context of the current discussion, where round-robin scheduling was the only type under consideration. In other types of cooperative multitasking, for example, a task can determine not only when to yield its time but to whom it should go. The JMPRET-style coroutines in the Prop I are an example of this. Not even preemptive multitasking and round-robin scheduling are strictly mutually-exclusive when separate tasks can have equal priority and vie for time, since this, too, could be done on a round-robin basis. But I think, for the sake of the current discussion, that round-robin scheduling is a given and that the choice is between cooperative and interleaved, single-cycle (== single-instruction on the Prop II) multitasking.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

ImageCraft · 2008-09-02 07:21

Chip, I think it will take a little time to digest all the implications. One thing I am trying to wrap my head around all these is how they would interact with a C environment. I suspect that the LMM C would ignore all these and only the asm driver-like code would care, but I can be wrong, and may be there is a nice way of providing Cog-threads even for LMM C.

// richard

heater · 2008-09-02 07:22

Amidst all this increasing complex discussion perhaps it's time stand back and ask: What actually are we trying to achieve here?

Here is the top level problem as I see it:

Lets say I have my Prop loaded up with various objects, maybe only one is mine, the rest from obex or wherever. I have no more COGS left but I really would like that, say, mouse driver object. Now I'm stuck.

I can't do COGNEW for my mouse because there is no COG, and waiting for a 16 GOG Prop is out.

But perhaps one of my COGS is not so heavily used so what I really want to do is something like THREADNEW (cogid, mouseobj, ....) which loads that nice ready made and simple to understand mouse driver into a spare thread on a little utilized COG. Without the need to hack around and combine two otherwise unrelated objects. Assuming of course the two objects I'm trying to run together don't use the same hardware resources.

THREADNEW seems to require position independent code so guess that a total non-starter at the assembly level. But really that's what I want to do, run some code somewhere, anywhare, thread, COG I don't care. BUT I don't want to have to mash unrelated objects together into a spaghetti soup!!

Well this leads me to a thought: We've accepted that the two objects to be combined are low power users, that is they could probably be LMM rather than native ASM, that is the threading should be at the LMM level. That is, it's not worth throwing a ton of hardware at this problem.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Sapieha · 2008-09-02 07:29

Hi Chip Gracey.

You said.
""But there is no speed benefit to making hardware perform this task"
Forgive My confusion and I not understad first You answer. But I was not clar in My explain.

That instruction is to for at program that RUN it skall not stop its runing.
That instruction must have flag Z maybe C and Z with flag My if I skall continue in program block with start instruction else jump to location on LOAD. Else DATA transmision end NEXT bufer.

Ps. And 1 instruction is beter if 6-8 instructions. With only 512 places - special registers

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Post Edited (Sapieha) : 9/2/2008 7:44:25 AM GMT

cgracey · 2008-09-02 07:42

Sapieha said...
Hi Chip Gracey.

If I start one COG as multitask system I are not interested in speed. It is in first plase more functionality to COG with tasks not time critical.

In that way I have 7 COG´s to critical taks

I understand what you're saying. So, special DMA instructions, while not faster than discrete instructions to perform the same task, would be nice just because they would save memory and·be simpler to set up.

Here is what discrete code might look like:

dmaloop········ RDLONG· cog_start,hub_start
··············· ADD···· dmaloop,h200······· 'increment cog_start in prior instruction
··············· ADD···· hub_start,#4······· 'increment hub_start by a long
··············· DJNZ··· dmasize,#dmaloop

Something like this would be cleaner:

··············· SETPTRA hub_start
················RDLONGS cog_start,(#)length
··············· <dma done>

Maybe we could add RDLONGS and WRLONGS instructions to take care of this.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Sapieha · 2008-09-02 07:47

Hi Chip Gracey.

Read my post before YOUrs.
Both·must have write in the same time. My question ..... Read You in My mind?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Post Edited (Sapieha) : 9/2/2008 7:59:27 AM GMT

cgracey · 2008-09-02 07:54

Sapieha said...
Hi Chip Gracey.

Read my post before YOUrs.
Wi must have write simuntat. My question ..... Read You in My mind?

Okay, so you want the DMA to not interrupt the cog execution, but be able to signal when done, or maybe have a special WAITDMA to ensure completion before proceeding. Since we are going to have DMA (hopefully) for the video generator, it may be simple to add this, too. Hub would have priority, then video.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

heater · 2008-09-02 07:55

Wow Chip: Didn't you somewhere suggest that it would be possible to read 32 bytes per HUB access interval. During a discussion about cachelines or such.

Well, couldn't RDLONGS with a length operand as you show be smart enough to do those 32 byte reads as required thus making it substantially faster than the "dmaloop" in your example ?

Perhaps also for BYTES and WORD and perhaps also fro writes, no ?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Sapieha · 2008-09-02 08:05

Hi Chip Gracey.

In both bit-banged and SERIN/OUT block mode protokols it is very useful with transfer DATA to or from HUB

In task switching it not stops actual process.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Phil Pilgrim (PhiPi) · 2008-09-02 08:09

Heater,

You make a compellng point: You can't throw a task at a problem with the same ease that you can throw a cog at it.

Allocation of scarce resources is never easy. (It can be fun, though!) In this case the scarce resource is silicon. One premise driving the Prop II design is its four-port RAM, which permits 160 MIPS execution speeds with a 160 MHz clock, but limits cog count due to the size of each memory cell. To maximize utilization of the available speed in the face of limited cogs, multitasking at the cog level has been proposed. The devil is in the details, though, with the discussion bouncing back and forth between hardware complexity (single-cycle, interleaved multitasking) and software complexity (cooperative multitasking).

One thing that has not been seriously broached, though, is whether down-shifting to a two-port RAM would free up enough silicon to lay out 16 cogs on a resonably-sized die. The speed of each would necessarily reduce to 80 MIPS, since the pipeline could not be so tightly interleaved. The downside, of course, is that apps which require the faster speed would somehow have to be split up and synchronized between two cogs, which adds another kind of software complexity. We've seen this already with some of the video work that's been done on the Prop I. I have a feeling that things have progressed too far down the road with the four-port design to consider a two-port alternative. But only Chip can answer that one.

My comments aren't in any way meant to favor one approach over the other, only to summarize where things now stand and the tradeoffs involved. I hope it's an accurate reflection and is helpful to those just joining (or rejoining) the discussion.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

cgracey · 2008-09-02 08:11

heater said...
Wow Chip: Didn't you somewhere suggest that it would be possible to read 32 bytes per HUB access interval. During a discussion about cachelines or such.

Well, couldn't RDLONGS with a length operand as you show be smart enough to do those 32 byte reads as required thus making it substantially faster than the "dmaloop" in your example ?

Perhaps also for BYTES and WORD and perhaps also fro writes, no ?

Yes, I was planning on wide reads and writes, but it was going to take tons of muxing. In light of the hardware task switching we were contemplating earlier, this would be trivial. So, maybe it can still be done. The 8-long accesses would necessarily be 8-long aligned, though (ie address %xxx...xxx00000).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Javalin · 2008-09-02 08:14

Chip,

Keep it simple and powerfull - like Prop1. Allow the programmer to create his/her own multitasking.

The only problem with the prop1 was that everybody created an object that ran on one cog - so by the time you have a gps, file system, some i2c, a bit of spi, a few serials etc, you are out of cogs. If there is an easy way (aka JMPRET) to allow people program custom objects that (for example) recieve gps, write to a file system etc then we'd be happy. This also allows programmers to get 100% of the time used within a cog

The other issue was trying to do several time intervalled tasks in a single cog - i.e. trying to use waitcnt/waitpeq etc - but as you say there are ways arround that.

Interesting (and very fast moving!) discussion though!

James

Post Edited (Javalin) : 9/2/2008 8:19:56 AM GMT

Sapieha · 2008-09-02 08:17

Hi heater.

It is 4-8 instructions for at work on. It waste registers and procesing time.
And in RDLONGS instruction it is no rom to have Nya HUB RAM address length.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

cgracey · 2008-09-02 08:20

Another thing about those 8-long accesses: There wouldn't be enough clocks to transfer 8 longs from hub to cog and still utilize every hub opportunity. Maybe 4 longs at a whack is better. It would take 2 cycles to input and store the first long, then 3 more to store the other four, leaving 3 for cycles for ongoing execution.

Actually, we could do 8 longs at a time, but we would have to suspend execution, as every time slot, including the one in which the RDLONG actually initiates, would be used to write a location in cog memory.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Sapieha · 2008-09-02 08:24

Hi Chip Gracey.

In my opinion 4 longs alternative is beter.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

cgracey · 2008-09-02 08:37

Sapieha said...
Hi Chip Gracey.

In my opinion 4 longs alternative is beter.

I've been thinking the same thing, because it would allow some other time slots in which the video could grab data. Otherwise, the video·might get·blocked for too long.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Sapieha · 2008-09-02 08:46

Hi Chip Gracey.

Have You read my post on enchancement to Video generator/counter.
I am only curious.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

QuattroRS4 · 2008-09-02 08:49

Chip Gracey said...
Yes, I was planning on wide reads and writes, but it was going to take tons of muxing. In light of the hardware task switching we were contemplating earlier, this would be trivial. So, maybe it can still be done. The 8-long accesses would necessarily be 8-long aligned, though (ie address %xxx...xxx00000).

But is that not a trade off .. sacrificing speed?

I know this all an avenue of exploration a 'sounding board' of sorts .. I am just wondering if all the points and suggestions outlined can be summarised·outlining the proposed·inclusions/exclusions and·the respective why/why not.

Regards,
John

EDIT:·wow ... the thread moved on about 4 posts while I was trying to digest what had just been posted.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Necessity is the mother of invention'

Those who can, do.Those who can’t, teach.

Post Edited (QuattroRS4) : 9/2/2008 8:59:24 AM GMT

Beanie2k · 2008-09-02 08:54

Something we might want to keep in mind is a variant of Murphy's Law which states: "If N cogs are available then a software developer will always want to run N+1 processes." Ergo maybe it would be better to focus instead on modularizing the Prop2 to allow them to be interconnected via a high speed data transfer and some form of handshake/semaphore/synchronization system. This way if more cogs are needed then simply add another Prop. Perhaps I'm naive but I think this gives the advanced people the power they need while still retaining basic simplicity for us newbies.

cgracey · 2008-09-02 08:56

Sapieha said...
Hi Chip Gracey.

Have You read my post on enchancement to Video generator/counter.
I am only curious.

I think you were talking about having it perform serializing/deserializing. I must look for it.

Counter augmentation is going to be a 'chapter' in itself. Lots of improvement can be done there.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

cgracey · 2008-09-02 08:58

QuattroRS4 said...

...I know this all an avenue of exploration a 'sounding board' of sorts .. I am just wondering if all the points and suggestions outlined can be summarised·outlining the proposed·inclusions/exclusions and·the respective why/why not.

Yes, I need to do that. I think I'm going to have to sleep now for a while, but I will try to summarize all this tomorrow.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Sapieha · 2008-09-02 09:01

Hi Chip Gracey.

I see You have missed VIDEO/WAV post enchancement suggestion.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

QuattroRS4 · 2008-09-02 09:02

Beanie2k said...
Something we might want to keep in mind is a variant of Murphy's Law which states: "If N cogs are available then a software developer will always want to run N+1 processes." Ergo maybe it would be better to focus instead on modularizing the Prop2 to allow them to be interconnected via a high speed data transfer and some form of handshake/semaphore/synchronization system. This way if more cogs are needed then simply add another Prop. Perhaps I'm naive but I think this gives the advanced people the power they need while still retaining basic simplicity for us newbies.

I think that it has been agreed that it is 8cogs/256k with inter PropII connectivity unless I have missed something - this thread is moving very quickly .. I like your Murphy's Law - N+1 ..lol

Now instead of throwing a cog at a task it will be throw a Prop at it !

Regards,
John Twomey

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Necessity is the mother of invention'

Those who can, do.Those who can’t, teach.

heater · 2008-09-02 10:14

Hi Phil: "You can't throw a task at a problem with the same ease that you can throw a cog at it."

I look at it the other way around, you throw a problem at a thread (or threads), the question is does that thread run on it's own processor or share a processor with other threads. That was the beauty of the transputer, at the source code level (Occam) you didn't have to work so hard to get things distributed when required or combine many threads to one CPU when required.

Anyway I concur.

PLEASE, PLEASE lets not sacrifice the 160Mips.

There will always be problems that need the raw horse power in one place. We love to see big numbers on the spec. sheet. With adequate COG to COG communication those that really need more COGS will just have to plop down another chip. As the yanks used to say about cars "you can't beat cubes". If any kind of multi threading is forth coming then we are effectively throttling down a COG to get more threads in BUT you can't throttle up having limited yourself in hardware. More MIPS seems more flexible than more COGS at this point.

Another silly car analogy: Most car engines don't have 16 cylinders, there seems to be a point round 8 or 12 where the pain of more cylinders offsets any gains.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2008-09-02 10:29

Oh yeah, I forgot to say, I want my 8080 emulator to really fly

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

hal2000 · 2008-09-02 11:11

I prefer 8 cog with 4K, a bus communications pair add more if necessary ProII, LCD panel or RAM, a 16-bit bus seems appropriate or even 24 bit addresses biggest.

8 and a small cog FPU implemented

·· this is·good for applications education.
8 cog fast, better than 16 cog clumsy·

16 cog clusmy

please

A greeting from
Spain!

hal2000 · 2008-09-02 11:15

I am using two ProI, P1 =TV ·P2 =control and calculation.
routine graphic·is very heavy· ....·

heater · 2008-09-02 11:30

After much calculation and contemplation I have come to the conclusion that the ideal number of COGS is ...... NINE.

Why?

1) 9 fits neatly on a 3 by 3 grid.
2) 9 is one bigger than 8 so those Murphys who need N+1 will now have it.

Now the 3 by 3 grid idea is actually serious. At least a logical 3 by 3 if not an actual physical layout. Want I'd like to see is seriously high speed links between neighbouring COGS even possibly 32 bit wide parallel. With some kind of WAITTX, WAITRX instruction whatever. This would enable neighbouring COGS to cooperate at speed on tasks that need the horse power without having to hang around in the HUB RAM. I understand some video codes already require two COGS.

So, to keep things nice and regular, as we like to see in our Props, each Cog would have 4 links to four neighbours.
Ah you say, the COGS round the edge don't have four neighbours. Well they do if you link the right side column to the left side column wrap around fashion and like wise the top row wraps round to the bottom row. Bingo nice and regular. Links, especially if serial, would optionally go off chip to neighbouring Props.

Now realizing that that is 18 lots of link hardware a another approach is to link the COG in the middle to it's eight neighbours, in a star topology. Actually this fits with the view that a Prop program is quite often one main application running on one COG with a bunch of peripherals running around it.

Of course neither of these topologies is entirely regular, you can't directly link from any COG to any other and perhaps worse you have to specify which COG runs what else they can't communicate.

That's it for crazy ideas today. I'll start taking my meds again...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Ken Peterson · 2008-09-02 13:14

Heater: Interesting idea. I do recall some talk about keeping every cog identical, so that you don't have to pay attention to which cog you launch in (I think it was also a discussion about using COGNEW vs. COGINIT). So how does one keep track of which cogs have which neighbors? Perhaps you can number the cogs with row/column pairs rather than serially. Such pairings would then be easier to keep track of. A multi-cog object might then require a "column" or "row" rather than just a cog or two.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
·"I have always wished that my computer would be as easy to use as my telephone.· My wish has come true.· I no longer know how to use my telephone."

- Bjarne Stroustrup

Sapieha · 2008-09-02 14:50

Hi heater.

From My discusion with Chip it is not sacrifice the 160Mips

I think 9 COGs is not option in plase on construction on binary counters nature (more complicated).
If Chip incorporate SERIN/OUT module I am proposed its speed is posible near COGs 160 MIPs and if it incorporates with mux posiblites it can comunicate All to All.
It is Chip that has posiblity to decide on it.

With VIDEO extensions Chip proposed in one of his posts it is not nessesary to have 2 COGs to VIDEO with very fine resolutions.
With My proposo to extend his VIDEO proposo it is even posible to have2 canals 16 Bits WAV generator with variable speed!

Propellers power is in functionality on every COG and PIN and it is as good as possible.
And I proposed to have counter mode like WatchDog with programed length.
That counter mode is nessesary to program My proposed own·Timesliced multitasking protocol

·

Ps. Chips proposo to extend ·VIDEO is on page 8 (Posted 8/29/2008 11:29 AM (GMT +1))

···· My proposo to extend it to VIDEO/WAV is on page 14 (Posted 9/01/2008·3:26 PM (GMT +1))

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Post Edited (Sapieha) : 9/2/2008 4:17:59 PM GMT

Should the next Propeller be code-compatible?

Comments