Cog Speed

Humanoido · 2010-09-03 00:02

We know theoretically the prop runs at 80 mHz and does 160 MIPs. The speed is often divided up and quoted as 20 mips per each of the eight cogs.

However, based on theoretical speed, if a program is run in one cog, it still runs at the full 160 MIPs.

Exactly what are we saying about the same small program running in eight cogs?

When does 20 MIPs apply?

Thank you in advance.
Humanoido

Heater. · 2010-09-03 01:05

No.

The instructions in one COG are executed at one quarter the clock speed. Except for those few instructions that require more than 4 clocks to execute.

So one COG at 80MHz is 20 MIPS.

If you can run that same small program in 8 COGs then you have 160MIPs.

This is all academic for larger programs and larger data sets where a lot of HUB access is involved and or LMM/interpreted code. Then we are down to 5MIPs per COG (LMM) or a tenth that (interpreted).

Ale · 2010-09-03 01:12

Humanoido you may be thinking of "that other processor" where one core can run between 1 and 8 threads. Up to 4 threads you get 100 MIPS per thread, between 5 and 8 you get 80, 66.7 57 or 50 MIPS per thread respectively. (@400 MHz, 500 MHz versions exist).
That is the case because there is only one pipelined core.

The Propeller has 8 independent cores and each performs up to 1 instruction every 4 cycles, @80 MHz equals 20 MIPS. This can be "improved" to 26 MIPS easily with a 6.5 MHz crystal

.

Humanoido · 2010-09-03 01:42

Heater & Ali,

For illustrative purposes, the prop is overclocked with a 6.25 mHz crystal. 125 tiny spin programs load into a cog.

A cog loader puts 125 programs in each cog and runs all 1,000 programs by multitasking.

What can be said about the function of each program (and each cog)?

Thank you very much for your replies.

Humanoido

Heater. · 2010-09-03 02:41

I really don't get the question.

Let's be clear here:

125 tiny spin programs load into a cog.

If this is 125 fragments of Spin code being run on a COG the only "program loaded into COG" is the Spin interpreter.

A cog loader puts 125 programs in each cog...

Again, if we are talking Spin, the cog loader, by which you mean COGINIT, is only loading one program, the Spin interpreter.

...and runs all 1,000 programs by multitasking.

Again the cog loader is not doing this, presumably you mean there is some "scheduler" Spin method that is calling 125 other Spin methods, as we have discussed elsewhere.

What can be said about the function of each program (and each cog)?

How can we know? You have conceived and/or written them and we have not seen them.

Humanoido · 2010-09-03 06:29

Heater: I should say "thread" and not computer program. I think it would be one "program" loaded into 8 cogs. I was thinking about the code we both saw before and asking about the timing slices if the number of instances grew to a theoretical 1000, spread across 8 cogs.

Heater. · 2010-09-03 06:39

I guessed you meant "thread" as in the Spin tasks we discussed elsewhere.

Assuming all those "threads" are the same code, we only have 32K HUB RAM so each of your 1000 instances only has 32 bytes of work space to itself (max). Not much to work on.

You have 8 of those Spin "schedulers" running, one per COG. Each is running 125 little task methods. So each task is running 125th of the speed it would if it was the main Spin program. Minus the overheads of the scheduler itself.

This is all very unrealistic isn't it?

Humanoido · 2010-09-03 06:44

Maybe there's a simple general expression that can be applied to spin programs in regard to timing, or must each of the program statements add up to determine the timing.. I'm sure there's a point where many threads will slow down the response time to almost nothing and a point where they will be at peak MIPs.

Leon · 2010-09-03 06:59

Obviously, a single thread will run fastest. You need a processor with hardware threads to maintain maximum speed for all threads.

Humanoido · 2010-09-03 07:04

Heater. wrote: »

Assuming all those "threads" are the same code, we only have 32K HUB RAM so each of your 1000 instances only has 32 bytes of work space to itself (max). Not much to work on.

You have 8 of those Spin "schedulers" running, one per COG. Each is running 125 little task methods. So each task is running 125th of the speed it would if it was the main Spin program. Minus the overheads of the scheduler itself.

This is all very unrealistic isn't it?

So what happened to the 2K cog ram? Why is it unrealistic? If the main spin program is doing its thing at (theoretical) 200 MIPs then 125th of the speed is a healthy 1.6 MIPs. Even if the spin interpreter is 4x slower, it's still .4 MIP. That would be 400,000 IPS, still 100 times faster than a BASIC Stamp at 4,000 IPS.

Yes, the loader has some overhead. In the future, maybe it can be overwritten to recycle the memory.

mpark · 2010-09-03 07:11

Humanoido wrote: »

We know theoretically the prop runs at 80 mHz ...

"mHz" = millihertz
"MHz" = megahertz <-- this is the one you want here.

/pet peeve

Humanoido · 2010-09-03 07:13

Leon wrote: »

Obviously, a single thread will run fastest. You need a processor with hardware threads to maintain maximum speed for all threads.

Leon, indeed, it's much more popular for large chip companies to run the designed threads in hardware for speed gains rather than in software for speed decrements. So I think you could say this posted thread is all about determining the speed decrement in a cog with multiple threaded instances.

Humanoido · 2010-09-03 07:16

mpark wrote: »

"mHz" = millihertz
"MHz" = megahertz <-- this is the one you want here.
/pet peeve

Oops..you're right and I knew that.. thanks for catching that one.. that may be the one notation that's a little confusing. You cannot do that with kHz..

Heater. · 2010-09-03 07:37

Humanoido:

So what happened to the 2K cog ram?

In this discussion we have 8 COGs running 8 Spin programs each running 125 instances of something unspecified. So those 8 COGs are all running a copy of the Spin interpreter which fills the entire COG.. That's where the 2K COG RAMS have gone. Programs written in Spin do not have any access to COG space (well exceot for the special function registers via Spin built in functions).

Why is it unrealistic?

As I said, each task only has 32 bytes RAM to play in and each one is running less than 125th the speed of a normal Spin program. Is there any realistic use for such a set up?

If the main spin program is doing its thing at (theoretical) 200 MIPs

What?!! Where did that 200MIPS come from?

We already determined that small PASM programs running with all data in COG can achieve 20MIPs for an 80MHz Prop.

Extend that so they make heavy use of HUB RAM data and we start to drop away from 20MIPs as HUB access is much slower.

Extend the program size such that we need PASM overlays or LMM to accomadate all the code and we can easily be down to 5 or less MIPs

But we are working in Spin. It's executable code is byte codes living in HUB (slow access as I said above) and each byte code might take 10, 20, 30 PASM instructions to interpret. So what have we here? byte code excution at 1 million per second or so? (Does any one have a figure for that?)

It might take 5 or 10 byte codes to implement each line a Spin program. So what now, 100 thousand statements persecond or so? (Just guessing).

Then we come to the 125 tasks we are running so divide again by about 100. Each task is running at less than 1000 Spin statements persecond.

Yes, the loader has some overhead. In the future, maybe it can be overwritten to recycle the memory.

What loader? We are running Spin codes from HUB here. There is nothing to recycle.

Leon · 2010-09-03 07:55

mHz might be about right!

ericball · 2010-09-03 08:45

Humanoido wrote: »

For illustrative purposes, the prop is overclocked with a 6.25 mHz crystal. 125 tiny spin programs load into a cog.

A cog loader puts 125 programs in each cog and runs all 1,000 programs by multitasking.

What can be said about the function of each program (and each cog)?

A COG instruction takes a minimum of 4 clock cycles, so that's 0.25MIPS per MHz. If using an external crystal with the x16 multiplier, that's 4MIPS per crystal MHz. That's the absolute maximum per COG. Each Prop has 8 COGs, so is capable of 32MIPS per crystal MHz.

But that assumes the COG is running a PASM application with no HUB access or anything else to slow it down. If the COG is running LMM, XMM, or an interpreter like SPIN it will be significantly slower.

If the COG is running some kind of super-duper multitasking / multithreaded program then each thread will receive a slice of the overall performance. Unlike something like the i7, the Propeller doesn't have multiple functional units which can be allocated to multiple threads so while thread 1 is executing a floating point operation or waiting on memory or I/O thread 2 can be executing integer operations. In fact, because thread/task management has to be done completely in software, the total performance will be lower. In other words, for maximum performance you add processors, not threads.

Beau Schwabe · 2010-09-03 09:18

Technically though ...

... if at 80MHz you have 20 MIPS and all 8 cogs run the same program instructions, then in terms of functionality it would still be 20 MIPS with lots of redundancy correct? :smilewinkgrin:

On the other hand if that same program were equally divided and distributed across the 8 cogs, so that each cog ran a portion of the program instructions, THEN you could say you were running 160 MIPS.

Humanoido · 2010-09-03 10:04

Heater: thanks for the clear explanation. The 200 MIPs came from the idea to overclock a chip with a 6.250 MHz crystal giving 25 MIPs per cog.

32 bytes could run a small program for demonstration, i.e. LED blinker program. It may illustrate timing, verify parameter passing, pins, counter, tasking, time slicing, and other added details.

The speed is slow, i.e. 4x slower than a BASIC Stamp 2, but overclocking will boost it a little, and 1000 IPS will still run programs. Those little blinker programs only have a few lines of code. If 10, then the program still runs at 1/100 of a second, entirely usable for demonstration purposes, right?

Leon: millihertz is 10 to the minus 3 hertz. You'd have to spin your cog backwards to come up with that number.

Humanoido · 2010-09-03 10:13

Beau Schwabe (Parallax) wrote: »

Technically though ...

... if at 80MHz you have 20 MIPS and all 8 cogs run the same program instructions, then in terms of functionality it would still be 20 MIPS with lots of redundancy correct? :smilewinkgrin:

On the other hand if that same program were equally divided and distributed across the 8 cogs, so that each cog ran a portion of the program instructions, THEN you could say you were running 160 MIPS.

Beau Schwabe, a good point. Cog space is really amazing! If I run multiple instances of the same task in one program, and put that one program into 8 cogs, then it should qualify for the 20 MIPs rule. So then, in that case, it would be 20 MIPs divided by the number of instances to get the speed of each instance. On that other hand, has anyone actually divided a program by eight and run it as a whole in eight cogs? I don't recall seeing any example of that.

Humanoido · 2010-09-03 10:21

Beau,

Your signature shows:
Beau Schwabe
IC Layout Engineer
Parallax Inc. * 599 Menlo Drive * Rocklin California 95765

but the avatar shows "Oklahoma."
Are you remote working?Humanoido

RinksCustoms · 2010-09-03 11:13

Beau Schwabe (Parallax) wrote: »

Technically though ...

... if at 80MHz you have 20 MIPS and all 8 cogs run the same program instructions, then in terms of functionality it would still be 20 MIPS with lots of redundancy correct? :smilewinkgrin:

On the other hand if that same program were equally divided and distributed across the 8 cogs, so that each cog ran a portion of the program instructions, THEN you could say you were running 160 MIPS.

Exactly the whole point behind parallel processing! Again, it's going to take imagination and some clever thought proccesses to bring parallel proccessing past the large data crunching days. Far as i can see in terms of definition, parallel proccessing seems to be not even implemented on the prop in the majority of applications. ie; say your main program is to do video editing like an XOR opperation on individual video frames and a static picture like an overlay. Using parallel proccessing by definition would mean loading say 4 of the cogs with the same XOR PASM code but passing different sections of the same video frame and static picture to different cogs. In this parallel application, the video proccessing app finishes the same frame of video and overlay opperation 4x faster than a single cog crunching the data.
I think the main thing the prop is best at is multitasking in parallel, gathering multi input while outputting multiple output all at the same time is ideally suited to the task of the P8X32A, however seeing the webinars of the preliminary features of the Prop II unlock many new possibilities for true parallel proccessing! Especially with divided Vdd pin groups, 1:1 instructions, 2x clkfreq and 64 I/O's. Really looking forward to the Prop II

Phil Pilgrim (PhiPi) · 2010-09-03 12:33

Rinks,

You raise an interesting point and one that I've been wrestling with for some time now. Most programs written for the Prop do not, in fact, use parallel processing at all. Rather, they implement one thread of execution that's serviced by one or more software peripherals, leaving several cogs idle. In a true parallel processing regime, there would not be any idle cogs. They would be assigned by the compiler to various portions of the main task. Unfortunately, Spin, being strictly procedural, is not the most convenient platform with which to implement such a scheme. A better programming paradigm might be dataflow, a message-passing system in which parallel resources are allocated automatically and dynamically. The downside of this, of course, is a loss of determinism, which may trump any advantages of parallel processing for the types of apps the Propeller is most adept at. This does beg the question, though: would the Prop be more adept at truly parallel apps with different dev tools?

-Phil

Leon · 2010-09-03 12:50

There are true parallel processing languages such as occam and XC, which are based on Hoare's CSP. They could be implemented on the Propeller, but the architecture isn't really suitable for them.

Humanoido · 2010-09-03 20:08

Heater wrote: Assuming all those "threads" are the same code, we only have 32K HUB RAM so each of your 1000 instances only has 32 bytes of work space to itself (max). Not much to work on.

Is there a Forum option to made a memory expansion board to increase the 32K HUB RAM?

I found this link.
http://machineinteltech.com/Memory_Expansion.html

Humanoido

Phil Pilgrim (PhiPi) · 2010-09-03 20:24

Humanoido,

The source you cited represents a formerly open sore around here that's better left unscratched. Moreover, their website seems devoid of any recent activity. My suggestion would be see if Bill Henning has something to offer. But be forewarned that any external memory will not be addressable with the efficiency afforded by the Prop's own internal hub RAM. Don't forget that hub RAM addresses are 16 bits. That's just enough for the Prop's internal 32K of RAM + 32K of ROM.

-Phil

Cog Speed

Comments