Should the next Propeller be code-compatible?

Ron Sutcliffe · 2008-08-29 02:55

@ Sapieha

My thoughts were based on

Hippy said...
If there's a RDTXFR and WRTXFR which will magically transfer longs between two Propellers connected using a single pin at high speed that would mitigate the number of Cogs needed by making multi-chip arrays easier. Something like that would be nice to have even if it wasn't a blindingly fast link.

magic is really bleeding edge stuff

Regards

Ron
·

Sapieha · 2008-08-29 03:01

Hi Ron Sutcliffe.

Look on it.

http://forums.parallax.com/forums/default.aspx?f=25&p=1&m=212396

It is one of My proposo to it. Picture 2.
It is not end construction. It was only idea to improwe Prop<>Prop comunikation and more.
magic You said!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

parsko · 2008-08-29 03:55

Will Prop II Spin be faster than Prop I assembly? (can you give rough numbers of the difference in speed?)

How about packaging two 8 cog props onto one chip? One of them could even have 32 differential ADC's, the other all 64 (96 pin Prop)?

What Mike said about utilizing all the cogs for differenct stuff (GPS, kb/mouse, I2C, SPI, etc...) it what I had envisioned using it for. It seems like 8 cogs is choking. But, if Prop II Spin is just as fast as Prop I assy, then most of that stuff could reside in Main memory anyway, possibly in larger routines that would not otherwise fit in one Prop I assy cog.

BTW- As a Mech Eng, I am having a hard time understanding most of this conversation! All of you really know your stuff!!!!

Post Edited (parsko) : 8/29/2008 4:03:16 AM GMT

hippy · 2008-08-29 04:34

Chip Gracey (Parallax) said...
If each new cog is more powerful than a whole current Propeller chip, do you really need 16 of them? Would 8 not suffice? Personally, I've never used all 8, except in some demo to show what the chip could do ... Are you guys sure about 16 cogs?

Do we need them ? Depends on how the Propeller is used, and what makes it great -

What's good about the Propeller is that to do something in parallel just throw a Cog at it. Quite often that will be just a few lines of Spin but it's easy and quick to do and elegant. Obviously the more Cogs the better and I think there are people who do run up against the 8 Cog limit.

We could all make those few line routines 'multi-task' but that's a lot of messing about compared to how easy it is when not having to.

Maybe that's the answer ... you do the multi-tasking, write the Mk II Spin Interpreter to run multiple Spin programs and you have a 16+ virtual Cog system on whatever you choose.

I think the true answer comes on what type of programmer it is aimed at; experienced or novice, Spin or PASM.

Phil Pilgrim (PhiPi) · 2008-08-29 04:51

I don't want to press this point if it's doomed to the dustbin; but I've been thinking more about interleaved multithreading* and the benefits it would confer with (what I would guess to be) little additional silicon. Those who have been crying, like a voice in the wilderness, for interrupts would, at last, be satisfied, since one thread could simply sit on a WAITxxx for an "interrupting" condition, while the other kept doing its "foreground" stuff. And the ominous intonings of those (myself included) who, like the chorus in a Greek tragedy, warn that interrupts would break determinism would be stilled as well. Of course, each thread (of two), if threading were enabled by the user's program, would be operating at half speed. But with the processing speed touted for the Prop II, that would hardly matter for many apps.

One possible gotcha for determinism might be if both threads need to access hub memory. But maybe the threads could be assigned alternate hub accesses, as well. In any event, if each hyperfast cog (of eight) could be made to act like two merely superfast ones, at the programmer's whim, a reasonable silcon budget compromise might be reached.

-Phil

*In interleaved multithreading (which I mistakenly called "hyperthreading" in a previous post), there are two program counters and two sets of condition bits. The threads take turns executing instructions, alternating them: ABABAB.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

Post Edited (Phil Pilgrim (PhiPi)) : 8/29/2008 4:59:25 AM GMT

lonesock · 2008-08-29 04:56

Phil Pilgrim (PhiPi) said...
I don't want to press this point if it's doomed to the dustbin; but I've been thinking more about interleaved multithreading* and the benefits it would confer with (what I would guess to be) little additional silicon. Those who have been crying, like a voice in the wilderness, for interrupts would, at last, be satisfied, since one thread could simply sit on a WAITxxx for an "interrupting" condition, while the other kept doing its "foreground" stuff. And the ominous intonings of those (myself included) who, like the chorus in a Greek tragedy, warn that interrupts would break determinism would be stilled as well. Of course, each thread (of two), if threading were enabled by the user's program, would be operating at half speed. But with the processing speed touted for the Prop II, that would hardly matter for many apps.

One possible gotcha for determinism might be if both threads need to access hub memory. But maybe the threads could be assigned alternate hub accesses, as well. In any event, if eight hyprfast cogs could be made to act like sixteen merely superfast ones, a reasonable silcon budget compromise might be reached.

-Phil

*In interleaved multithreading (which I mistakenly called "hyperthreading" in a previous post), there are two program counters and two sets of condition bits. The threads take turns executing instructions, alternating them: ABABAB.

I do like this idea, but a question: does this mean A) each cog would need 2 sets of shadow registers (thus losing 32 instead of just 16 longs), and

each cog is then left with only 240 longs per virtual-cog?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Phil Pilgrim (PhiPi) · 2008-08-29 05:01

lonesock,

No to both questions. The threads would share the cog resources as determined by the programmer. Since the threads' memory accesses would be interleaved, there would be no internal bus conflicts. The "ABAB" scheme I mentioned is a temporal one, BTW, and does not relate to memory addresses. Each thread would have its own program counter and condition bits and could execute code anywhere in the cog's memory. Two threads could even share the same code under this scheme, except where code gets modified, for example, by JMPRETs.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

rokicki · 2008-08-29 05:03

Bigger faster means able to do more complex things; for more complex things, I believe you cannot beat more memory
vs more cogs. I really envision most people moving to some sort of LMM for most things, and some limited low level
code for video and stuff. If we get some sort of shift-in (of almost any sort) serial I/O will be *much* easier, so
I'm not worried about keyboard/mouse.

Each cog will be *so* fast; I really do think 8 cogs is plenty.

I am not sure, but I suspect with eight cogs the timing might be just a bit easier on the chip itself. And certainly the
smaller cheap means a lower price.

The model of "throw a cog at everything" is a neat model, but I am *much* more excited about 256K RAM than I am
about 16 cogs.

Good serial I/O would mean reasonably good abilities to synchronize two chips (although it would be nice if it were
easy to lock-step the separate chip oscillators/PLLs somehow).

Sleazy - G · 2008-08-29 05:14

Hey chip.

Chip Gracey (Parallax) said...
ANOTHER QUESTION:

If each new cog is more powerful than a whole current Propeller chip, do you really need 16 of them? Would 8 not suffice? Personally, I've never used all 8, except in some demo to show what the chip could do. To me, 8 is quite rich. By the time we get to 16, we are hub-starved and have to resort to cache-line style hub accesses to get the bandwidth back up (well, way up).

Are you guys sure about 16 cogs?

····· Didnt you say you were quad-port pipelining the HUB in prop II?· Wouldnt this make the cache-line hub accesses less of a··
····· practice in starvation?
····
···· ·ive currently got a protoboard using all 8 cogs.·In numerical order.

····· cog·#········· process
········0 ·············main/OS/mgmt/sys monitor
········1············· Nintendo ADVANTAGE joysitck (you remember)·19200 baud serial
········2··············Matrix 4x4 keypad
········3············· 2x16 backlit serial LCD 4800 baud
······· 4············· SERIAL OUT data·to 4800 baud RC transmitter
······· 5············· high-speed servo logic algorithims
······· 6············· local Servo A pulse out
······· 7············· local Servo B·pulse out

······· Most of the·cog methods which input data dump it to the hub, to·be accessed later by any other cog who requests it from
······ the·hub.
······ ·If i try to put 2 methods on one cog, I end up·losing speed·since the·hub access frame (with respect to pending·hub data···
······· updates) keeps growing.·

······· Example,·lets say that it takes roughly 10 clocks for the keypad·method in COG #2 to update·a hub·byte array which holds
······· the current·kpd button·status. Also lets say that it takes the joystick method·roughly 10 clocks for COG #1 to update a
······· hub byte array which holds the status of the joystick/nintendo buttons.· If i run each method in separate cogs as they are·
······· now, i can be assured that the hub will recieve new data from COG #2 and COG #1 EVERY 16 clocks.· On the other hand, if i
······ ·were to combine the joystick and keypad methods in one cog, and assuming 20 clocks between LOCAL variable updates (10
······· clocks for joy, 10 for kpd), we are not assured that every hub access frame will retrieve a new value from those joystick
······· and keypad methods.· For example , in 16 clocks we could have·an updated joystick value but not an updated keypad value,
······· or vice versa, ·since either the keypad method or the joystick method would be incomplete within one·hub access cycle of
······· 16 clocks.· so we are forced to wait longer for hub updates when running code in one cog, which could be split inbetween
······· two.·

······· I know 10 clocks doesnt sound like alot for an update cycle for a method, but take the above example and imagine 100·
······· clocks·required. It would guarantee in two cogs that every·7 hub access frames we would return·updated values from the
······· kpd and·joy methods, conversely, running both kpd and joy methods in the same cog would·only guarantee that·either
······· kpd·or joy variables were updated, BUT NOT BOTH, ·in·7 hub cycles, and in the correct phase should cost us an extra hub
······· frame.·This will eventually cause an aliasing problem when·outputting these values with deterministic timing·from the hub to
·· ···· another COG which·could be·another method which processes the values.

····· · This makes average input update rates·slowed when not parallel processed.· I like 16 cog idea.

······· But i have a better idea.· Why not use fractal cogs, where you could have 8 "super" cogs, each consisting of 8 regular
······· cogs (64 regular cogs total,·or·8 groups of 8 cogs).· the "Super" cogs would have their own "super" hub, and the regular
······· cogs would have their regular hub (8 of them total).· This would mean 1 "super" hub and 8 "regular" hubs.· You could
······· seemingly do this with 8 propellers grouped together with a mutually exclusive·"Super" hub connecting them all. Do you see
······· the fractality?· Propellers into Cogs?
·······
······ This seems to be·a good compromise if it can be done,·since it would allow·all 64 pins to have·their own·dedicated cog.
······ that way you could run·exclusive serial methods from each pin with parallel processing.·

······· ·
·

heater · 2008-08-29 05:46

Chip:

Perhaps I should not ask but: If you were really starting from a "clean slate" with the silicon resources available for Prop II and given the "invention" of the LMM technique, would Spin even be byte code based or would it be LMM based ?

I'm starting to warm to the idea of only 8 COGs. After all whichever way you look at it, it allows TWICE the HUB access rate for each cog and gives us a cheap device.

BUT I want to see that COG to COG intra- and inter- Prop serial communications squeezed into any remaining space on the 6 by 6 mm chip.

Anyone who really needs 16 GOGs can then throw an other chip at their problem and transparently move code over to it. Like the good old Transputer. More likely they really only need more than 8 threads which can be supported by a threaded LMM machine.

It has been said that "a camel is a horse designed by a committee", I hope this committee is not pushing for to many humps and bumps in the prop design.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2008-08-29 05:58

Sleazy-G: Are you seriously suggesting that human interface devices like keypad, joy stick, and LCD really need a 32 bit 80MHz CPU EACH? And just for the purpose of ensuring samples are arriving in phase some how.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Paul Baker · 2008-08-29 06:02

The following is just personal input into the discussion: When everyone talks about the need for interrupts, thats not really what they are asking for, they're talking about the tool, not the objective. Everyone asking for this actually wants multitasking, and interrupts are how most implement it. I don't know how hard it would be to implement, but what if some form of task switching were allowed, minimal count, say 1-4 tasks. It would use the same mechanism as JMPRET (therefore needs a location to store it's return value), but would automatically switch every X cycles, and the C and Z flags are automatically stored when switched out of and restored upon startup.

This way everyone sorta gets what they want. The 8 cogs keeps hub access quick, remove the need for getting into nasty caching (I speak from personal experience; I examined over 100 patent applications concerning caching in multi-processing environments while at the USPTO and caching is something to be avoided like the plague). Those people that don't need the raw speed can bifurcate (?quadfurcate?) cogs into sharing processes so it looks like they have 32 processors each still having twice the processing power of the current cog.

How difficult would that be Chip? And what are the hidden gotchas I'm missing?

<another thought, maybe one of the counters can be used as the trigger>

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Post Edited (Paul Baker (Parallax)) : 8/29/2008 6:22:49 AM GMT

Ariba · 2008-08-29 06:07

Chip Gracey (Parallax) said...

The 16-cog die would be 64 square millimeters (8x8), at a cost of about $.10/mm, or $6.40 before packaging and yield loss. The current die is 52 square millimeters at a cost of about $.05/mm, or $2.60 before packaging and yield loss. So, it would be a lot more money....

So, the new silicon process is 2 times more expensiv than the old one. This means also the 8 cog/ 128kByte Prop2 costs much more than Prop1 (36 square millimeters = $3.60 -> 1.4 times more equals in an Endprice of $18 for a single unit).

In this case I have to vote for a 4 Cog version with 64kByte HUB-RAM !
This only needs 4mm x 4mm = 16 square mm = $1.60 and an Endprice of $7.90 per unit, or $4.64 at 5000 pcs.

Such a chip is still 4 times more powerful than Prop1 and has a HubAccess rate of 40 MHz! That is: LMM code can be executed 2 times faster than native Assembler code on Prop 1.
If 32 I/O Pins are enough, then it still can have a 40 Pin DIP.

OK, the attached Picture is a joke, and I see that it will be very hard to explain why the Prop2 has a lower Cog count than Prop1, but I seriously would like to buy such a chip, perhaps as a third member of the Propeller Familie. And I am sure this would be the most selled chip of the 3.

Andy

Phil Pilgrim (PhiPi) · 2008-08-29 06:07

One modification to my threading proposal that might come in handy for things like JMPRETs:

Reserve an area at the very beginning of program memory consisting of two 16-long blocks that get overlain with each other, each one accessible only by its associated thread. When the cog is loaded, both sets would get initialized with the same data from the hub, and execution would begin there as a single thread. The first instruction could be a jump out of the overlay area or the beginning of initialization code that can afford to get overwritten later. If/When the "split" operation takes place, each thread could write data in this area that the other thread would not see. This makes it useful for return addresses that can be used anywhere else in the code area via indirect JMPs. Putting the return addresses there would allow two threads to share the same subroutines without conflict.

I was also thinking about this in terms of LMM programs. I know the instruction set is full, but if it were somehow possible to add an EXEC instruction, it would come in handy. An EXEC instruction simply tells the processor to execute the single instruction at the EXEC's destination address, then resume with the instruction after the EXEC (assuming the EXECed instruction wasn't a JMP). That way, instructions emulated by the LMM kernel could be read into the overlay area and executed from there. This would allow two LMM kernel threads to share the same code.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

Nick McClick · 2008-08-29 06:30

Ask a gearhead how much horsepower is enough...

Give me a CHEAP chip, 16 pin DIP, 4 cores. If a prop II runs 160MHz & 1 instruction / clock, a 4 core Prop II is already 4 times faster than the prop I!

Binary compatibility is over-rated. I'll be fine as long as you keep the overall spin & assembly syntax (and don't stop selling prop I's).

There are already companies that make expensive, bloated platforms. You've given us reasonably priced high-performance, the next step is cheap high-performance.

I guess I've thought of parallax as an atypical IC maker & I find the current prop II plans to be very typical; more MIPS, more pins, more complexity, more MHz, greater cost.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Concentrate on understanding the problem, not applying the tool

Post Edited (Nick McClick) : 8/29/2008 6:46:37 AM GMT

ErNa · 2008-08-29 06:51

Now I understand, why PropII is not finished yet. May be we should learn from the transputer. The propeller lacks from having no standard for cog intercommunication. We still think in writing to the hub memory. Or to the global memory. But such a memory doesnt exist. What really exists is a communication mechanism that transfers information between different processes and storage to store information. These two things are completely different. The hub mechanism only is an embodiment of these absolutely different tasks. You see this in the fact, that data can be stored locally or globaly. Communication can be realised by memory cells or by I/O. Even cogs in one chip can communicate very fast via I/O-Pins.

Code compatibility is not the main aspect. The power of the propeller is parallel processing. Granularity of the programs is determined by the number of physical processors. With OCCAM you program with the imagination of infinite number of processors and only after you finish the programming you determine the neccesary computing power and manually distribute the load to an adequate number of processors. In my concept I use one cog to establish communication between propellers.

cgracey · 2008-08-29 06:54

You guys have a lot of good ideas.

Phil, it would take only a little hardware to provide, say, 4 different PC/Z/C sets. Threads·could be turned on and off like this:

····· tacknew· D··· 'begin new thread from address D (D is used as a constant)

····· tackid·· D····'get thread id into D (can differentiate threads for·'return' purposes)

····· tackend······ 'end this thread (ends, returns time to other threads)

Any of the 4 threads could be active and they'd execute in round-robin fashion, just like the hub, but with empty slots being skipped. Initially, only thread 0 would be running. This is really simple, but making overlays is a pain due to increased physical memory. I'm not picturing a compelling reason why you'd need to share routines (RET issue), either. In this scheme, you'd just have to suffer if one thread did a WAITCNT (in other words, don't use instructions like that). RDxxxx/WRxxxx would cause some delays. Threads would have to be programmed so that they wouldn't try to multi-use things like the background math state-machines, the REPeat instruction, the indirect registers, the hub memory pointers, etc. Overall, everything would execute pretty quickly, though, with minimal delays and no programmatic context switching (as with JMPRET). Maybe it would be so hobbled by restrictions that it would be borderline useless. It would take some silicon to replicate the indirect registers and pointers. I actually think JMPRET does a pretty decent job, especially JMPRETD which wouldn't flush the pipe, so that two trailing instructions would execute, bringing the context-switch down to just one clock.

Would this tacknew/tackend be much of a solution? It IS simple.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 8/29/2008 7:23:48 AM GMT

ErNa · 2008-08-29 07:03

@Paul Baker: You are right. People often don't think what they want but how they can do it. A man, owning a turning machine builds a world of round things. When I started with my Z80 I had a Fortran compiler and Fortran was good for batch programming. With a few lines of code in assembler I could interface to the bios conio and status and from this moment I could program dialog software. And my multitasking was just roundrobbing of small modules in an infinite loop, every module first checked for a active flag and immediately returned if false. The module had do be small, so the loop run in realtime and the priority could be determined by just calling the high pri modules more often in a loop. Thats all and therefor it is out of question to have a multitasking spin program: just a loop and semaphores. But tuning is a extra task do do manually that could better be done by a compiler. If the language has appropriate elements.

Timmoore · 2008-08-29 07:08

The other bug gotcha with multithreading are calls, since they use a single memory location for return address, they either have to be avoided, a stack added, etc. it has the same limitations as jmpret.

One other comment about the earlier discussion on de-serializers, look at allowing the serialer/deserialer to be combined, for example trying to do an SPI bus using them needs 1 clock for in/out.

Sapieha · 2008-08-29 07:26

Hi All

Propeller power is simplicity.
Do not construct monster.
All new adition must have same simplicity.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
For every stupid question there is at least one intelligent answer.
Don't guess - ask instead.
If you don't ask you won't know.
If your gonna construct something, make it·as simple as·possible yet as versatile as posible.

Sapieha

Post Edited (Sapieha) : 9/25/2008 12:26:31 PM GMT

Phil Pilgrim (PhiPi) · 2008-08-29 07:29

Chip,

Might the WAITxxx instructions work if interleaving were done at the processor clock level, rather than the instruction level? In other words, each thread would get one cycle at a time. If one was in the middle of a WAITxxx and its condition was not met, the state of that thread would not change, and the next thread would get the next cycle. The only thing that changes with the WAITs is that their time granularity would be proportional to the number of concurrent threads. Under this scheme, you'd have to keep track, not only of each thread's PC and flags, but also the state of its pipeline. But at least a WAITxxx wouldn't bring the whole cog to its knees.

I can see your point about an overlain section of memory and the pain it would be to implement. Absent a way to share subroutines, I still believe threading would be useful — especially where the Prop II's blistering speed is overkill for a single process.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

Sleazy - G · 2008-08-29 07:39

heater said...
Sleazy-G: Are you seriously suggesting that human interface devices like keypad, joy stick, and LCD really need a 32 bit 80MHz CPU EACH? And just for the purpose of ensuring samples are arriving in phase some how.

······Listen, i think if you had experience with these applications you would know·that the·overhead involved with processing their signals makes·any time spared quite treasured.· Optimization of system response time is the main objective here.· It seems like 80Mhz is alot if you are writing ASM, but still the simplest of ASM·operations·take 4 clock cycles to achieve, effectively knocking the clock to 20Mhz if you were thinking "one clock one operation", unless you interleave 4 cogs, which then leaves you with·your resources halved at 80Mhz, and would not be suitable with multiple peripherals.· If youre writing spin that interprets into·a similar ASM operation , I remember it taking·something like 200 clocks as opposed to 4 clocks when using SPIN to call an ASM dat block.

remember the hub window is 16 clocks. In the instance of my protoboard setup, the·cogs must relay the status of the peripherals to the hub, so that other cogs·may do further operations with the value.

SO LOOKING AT IT REALISTICALLY, estimated·time for·BEST CASE SCENARIO for ASM using hubRAM for COG to COG communication for this specific application.

·So theres a read op on the pins (4 clocks), the NES joystick is naturally an 8 bit·serial decimal input·to be decoded bitwise(minimum 4 X 8 clocks)·into a bytestring (min 4 X 8 clocks), ·then·combined with the keypad data which is a bytestring (min 4 X 16 clocks),·then a write to the hub·followed by an·immediate read·from the hub (minimum 16 clock hub frame X 7 longs of data), then·calculation of pulse·width change of the servos compared to the joy & kpd status(for your sake we'll call 0 clocks for now), which is a bytestring of 25 elements to be compared to tables·(minimum 4 X 25 clocks)·, then output of the signal (4 clocks).

Do the math, its a 348 clock interval between updates of the servo with respect to the inputs, assuming NO processing inbetween.· so think of that 80Mhz clock·more of like a 230kHz clock before you start even thinking about data processing overhead.· The whole·world is not soft.

The more cogs you have dedicated and·parallelled, the less size your loops may have, minimizing·processed throughput·times from dedicated cog to dedicated cog.· This also ensures your timing is deterministic.· It is also very compatable with the nature of most library objects.·· This structure allows cog's to run at their ultimate performance locally, but at sacrifice of power consumption.

And with respect to the "arrive in phase some how", you need to·update hub RAM through·the hub window at·intervals·quantized by the·dedicated COG loop·speeds·to ensure synchronicity.· Else,·comparative logic operations·between the joystick·values and keypad values·could prove·unreliable, dependent on where they reside.·

·····

cgracey · 2008-08-29 07:40

Phil Pilgrim (PhiPi) said...
Chip,

Might the WAITxxx instructions work if interleaving were done at the processor clock level, rather than the instruction level? In other words, each thread would get one cycle at a time. If one was in the middle of a WAITxxx and its condition was not met, the state of that thread would not change, and the next thread would get the next cycle. The only thing that changes with the WAITs is that their time granularity would be proportional to the number of concurrent threads. Under this scheme, you'd have to keep track, not only of each thread's PC and flags, but also the state of its pipeline. But at least a WAITxxx wouldn't bring the whole cog to its knees.

I can see your point about an overlain section of memory and the pain it would be to implement. Absent a way to share subroutines, I still believe threading would be useful — especially where the Prop II's blistering speed is overkill for a single process.

-Phil

Phil,

I think it would create havoc within the pipe from having some instructions advance, while others stalled. Imagine someone·magically staying still in the middle of·a fully-loaded·moving escalator.

Another thing to think about: imagine we did allow a thread to do a stalled waitcnt, while at the same time another thread did a hub instruction, which caused the time window to pass for the waitcnt. We'd have to replicate all kinds of circuitry to keep things correct. I think it's gotta be first-come-first-serve, or there's gonna be trouble.

A simple context switcher would not take much hardware, but be rather elegant. I modified my post above to add a 'tackid D' instruction which could be used to differentiate threads for 'return'-type purposes.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi) · 2008-08-29 07:50

Okay, I think I know a way around the JMPRET problem that doesn't require overlays. (Tim: JMP, JMPRET, CALL, and RET are all JMPRETs, BTW.) If you do an indirect JMPRET, only bits (8..0) of the long at the source address are currently used. Bit 9 could be used as a flag to tell the processor to add the thread number (0, 1, 2, or 3) to the destination address where the return address gets stored:

        JMPRET  Sub_Ret,SubPtr

Sub     ...
Sub_Ret JMP     0-0
        JMP     0-0
        JMP     0-0
        JMP     0-0

SubPtr  LONG    $200 + Sub

The number of JMP 0-0 instructions could be varied, depending on the maximum number of active threads.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

Phil Pilgrim (PhiPi) · 2008-08-29 08:04

Chip Gracey said...
I think it would create havoc within the pipe from having some instructions advance, while others stalled. Imagine someone magically staying still in the middle of a fully-loaded moving escalator.

I guess I was thinking in terms of multiple escalators (pipelines). If one stalled, it wouldn't affect the others. Each would have its own state counter, instruction register, source register, and destination register and receive a "tick" every nth processor clock. This is proposed out of ignorance about how much silicon that entails, though, or what else would have to be replicated to make it work.

As an alternative, perhaps a stalled WAIT could simply "give up its turn", acting as a JMP to its own location, when multiple threads are involved, in order to keep things moving. This would increase the granularity to n times the number of clocks per instruction, but that might not matter for some apps. The user would simply have to be aware of the limitations.

My gut tells me that, without solving the WAITxxx issue, those who want to emulate interrupts will be left a bit empty. Not that I agree with them, necessarily; but it would be nice to scratch that itch if it's easy enough to pull off.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

Post Edited (Phil Pilgrim (PhiPi)) : 8/29/2008 8:17:05 AM GMT

Timmoore · 2008-08-29 08:11

@phil, I was aware, I mentioned call specify because thats the common usage that will have problems. The common use of jmpret for coroutines wouldn't because it usage is to emulate what you are trying to do.

I went through the 4 port serial port to see if this would be useful. The 4 port limitation was hit for 2 main reasons - I wanted to run the ports at 115K baud and 4 is the max before it couldn't run at 115K, this limitation will go because of the extra speed of the cog. The cog memory runs out at 4 (there are ~8 longs free in the cog).
Staying at 4 ports jmpret/threading are the same. With the extra speed I could support more than 4 ports if I get over the memory problem though at that point I can't use threading. With the extra speed I can remove a lot of the optimizations and still meet 115K, and add extra code to allow 1 copy of the code to handle 4 ports by having it calculate which data structures to use on the fly. Doing that I think it could handle 6 ports before it runs out of cog memory again.
So the 4 port driver on propII has 2 options - stay at 4 ports and run at higher baud rates (either jmpret or threading would work), handle up to 6 ports (only jmpret would work). If I stayed at 4 ports but used threading the max baud rate would drop ar the saving of ~120 longs of memory.
I havn't looked at the combo keyboard/mouse driver for the same question - that one is more complex because it uses calls quite a lot, it is also close to running out of cog memory.

cgracey · 2008-08-29 08:12

Phil Pilgrim (PhiPi) said...
Okay, I think I know a way around the JMPRET problem that doesn't require overlays. (Tim: JMP, JMPRET, CALL, and RET are all JMPRETs, BTW.) If you do an indirect JMPRET, only bits (8..0) of the long at the source address are currently used. Bit 9 could be used as a flag to tell the processor to add the thread number (0, 1, 2, or 3) to the destination address where the return address gets stored:
        JMPRET  Sub_Ret,SubPtr

Sub     ...
Sub_Ret JMP     0-0
        JMP     0-0
        JMP     0-0
        JMP     0-0

SubPtr  LONG    $200 + Sub
The number of JMP 0-0 instructions could be varied, depending on the maximum number of active threads.

-Phil

Phil,

That's a neat idea! There's still some mechanism needed to pick which return to execute, unless it maybe set bit 9 and put the thread number in the bits just above, then it would know to NOP that instruction until it got·a match.

I realized a few minutes ago that there are some complexities about doing even a simple switcher in hardware - the pipe is filled with who-knows what, so you'd have to have the thread number propogate through the pipe so that same-thread instructions can be canceled during a jump. This would preclude the use of JMPD/CALLD/RETD/JMPRETD during threaded execution.

In case you didn't read elsewhere, we've got a new JMPRETD instruction which DOESN'T flush the pipe on a branch, leaving the two trailing instructions to execute. Here's the difference between JMP and JMPD (inst = any non-branching instruction):

·········· inst1···· ··········· 'normal jump, occurs in-situ (3 clocks)
·········· inst2
·········· jmp···· somewhere

·········· jmpd··· somewhere···· 'delayed jump, occurs·2 instructions later (1 clock)
·········· inst1
·········· inst2

·········· jmpd· ··execinst ···· 'here is your EXEC operation
·········· jmpd··· #$+1
·········· inst
·········· <execinst would execute in this timeslot>

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

cgracey · 2008-08-29 08:21

Phil Pilgrim (PhiPi) said...
...As an alternative, perhaps a stalled WAIT could simply "give up its turn", acting as a JMP to its own location, when multiple threads are involved, in order to keep things moving. This would increase the granularity to n times the number of clocks per instruction, but that might not matter for some apps. The user would simply have to be aware of the limitations.

Here's the gotcha: it's already got a potential instruction of its own thread coming down the pipe. That could be canceled to perform the 'jump', costing at least three clocks to get back in the saddle, in which time the hub might have given the anticipated acknowledge the thread was waiting for. Then, we'd have missed the bus.

I think using JMPRETD is·a pretty simple mechanism to employ for multi-threading, and you could have as many threads as you want. It would just cost an instuction and a clock cycle at each switch point.

Well, I take that back. Hardware threading could work okay, but you couldn't use JMPRETD's because your own two instructions may not be in the pipe. You'd have to stick to JMPRET's only. There'd be a few rules you'd have to follow. It would save the expense of·JMPRETD context-switching instructions all over the place, along with the clock cycles needed to execute them.

The thing is, when you do multi-threading, however you do it,·NO thread has any determinancy due to the others' behavior. It's kind of a mess. I wish I had some compelling example of how it could save the day. I mean, it's neat, but it kills determinancy and creates jitter in all the threads as soon as one executes even a RDxxxx/WRxxxx, which SOMEONE has to do!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 8/29/2008 8:39:20 AM GMT

Phil Pilgrim (PhiPi) · 2008-08-29 08:27

Chip Gracey said...
There's still some mechanism needed to pick which return to execute

Oops!

I'll have to think about that in the morning, along with the JMPD instructions. I'm still in sleep deprivation from my bout with a computer crash last night and can no longer keep my eyes open.

More later!

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

QuattroRS4 · 2008-08-29 08:59

Assuming an 8 cog/256k prop is agreed - with functionality for inter/intra prop cog usage .. How exactly would a proposed multi Prop2 setup work. Assuming the first PropII is a master the second is a slave (purely a cog donor) ..So essentially the scenario is as follows - One eeprom connected to master Prop, 1 master PropII, 1 or more Cog donor propII's,1 clock source.

How does such a system boot with the 'slave' or 'slaves' "knowing" that they are essentially cog donors without having to be programmed? How is the clock Synchronised successfully?.How are the slave I/O's addressed?

Regards,
John Twomey

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Necessity is the mother of invention'

Those who can, do.Those who can’t, teach.

Post Edited (QuattroRS4) : 8/29/2008 9:14:40 AM GMT

Should the next Propeller be code-compatible?

Comments