Cog Speed
Humanoido
Posts: 5,770
We know theoretically the prop runs at 80 mHz and does 160 MIPs. The speed is often divided up and quoted as 20 mips per each of the eight cogs.
However, based on theoretical speed, if a program is run in one cog, it still runs at the full 160 MIPs.
Exactly what are we saying about the same small program running in eight cogs?
When does 20 MIPs apply?
Thank you in advance.
Humanoido
However, based on theoretical speed, if a program is run in one cog, it still runs at the full 160 MIPs.
Exactly what are we saying about the same small program running in eight cogs?
When does 20 MIPs apply?
Thank you in advance.
Humanoido
Comments
The instructions in one COG are executed at one quarter the clock speed. Except for those few instructions that require more than 4 clocks to execute.
So one COG at 80MHz is 20 MIPS.
If you can run that same small program in 8 COGs then you have 160MIPs.
This is all academic for larger programs and larger data sets where a lot of HUB access is involved and or LMM/interpreted code. Then we are down to 5MIPs per COG (LMM) or a tenth that (interpreted).
That is the case because there is only one pipelined core.
The Propeller has 8 independent cores and each performs up to 1 instruction every 4 cycles, @80 MHz equals 20 MIPS. This can be "improved" to 26 MIPS easily with a 6.5 MHz crystal .
For illustrative purposes, the prop is overclocked with a 6.25 mHz crystal. 125 tiny spin programs load into a cog.
A cog loader puts 125 programs in each cog and runs all 1,000 programs by multitasking.
What can be said about the function of each program (and each cog)?
Thank you very much for your replies.
Humanoido
Let's be clear here:
If this is 125 fragments of Spin code being run on a COG the only "program loaded into COG" is the Spin interpreter.
Again, if we are talking Spin, the cog loader, by which you mean COGINIT, is only loading one program, the Spin interpreter.
Again the cog loader is not doing this, presumably you mean there is some "scheduler" Spin method that is calling 125 other Spin methods, as we have discussed elsewhere.
How can we know? You have conceived and/or written them and we have not seen them.
Assuming all those "threads" are the same code, we only have 32K HUB RAM so each of your 1000 instances only has 32 bytes of work space to itself (max). Not much to work on.
You have 8 of those Spin "schedulers" running, one per COG. Each is running 125 little task methods. So each task is running 125th of the speed it would if it was the main Spin program. Minus the overheads of the scheduler itself.
This is all very unrealistic isn't it?
Yes, the loader has some overhead. In the future, maybe it can be overwritten to recycle the memory.
"mHz" = millihertz
"MHz" = megahertz <-- this is the one you want here.
/pet peeve
In this discussion we have 8 COGs running 8 Spin programs each running 125 instances of something unspecified. So those 8 COGs are all running a copy of the Spin interpreter which fills the entire COG.. That's where the 2K COG RAMS have gone. Programs written in Spin do not have any access to COG space (well exceot for the special function registers via Spin built in functions).
As I said, each task only has 32 bytes RAM to play in and each one is running less than 125th the speed of a normal Spin program. Is there any realistic use for such a set up?
What?!! Where did that 200MIPS come from?
We already determined that small PASM programs running with all data in COG can achieve 20MIPs for an 80MHz Prop.
Extend that so they make heavy use of HUB RAM data and we start to drop away from 20MIPs as HUB access is much slower.
Extend the program size such that we need PASM overlays or LMM to accomadate all the code and we can easily be down to 5 or less MIPs
But we are working in Spin. It's executable code is byte codes living in HUB (slow access as I said above) and each byte code might take 10, 20, 30 PASM instructions to interpret. So what have we here? byte code excution at 1 million per second or so? (Does any one have a figure for that?)
It might take 5 or 10 byte codes to implement each line a Spin program. So what now, 100 thousand statements persecond or so? (Just guessing).
Then we come to the 125 tasks we are running so divide again by about 100. Each task is running at less than 1000 Spin statements persecond.
What loader? We are running Spin codes from HUB here. There is nothing to recycle.
But that assumes the COG is running a PASM application with no HUB access or anything else to slow it down. If the COG is running LMM, XMM, or an interpreter like SPIN it will be significantly slower.
If the COG is running some kind of super-duper multitasking / multithreaded program then each thread will receive a slice of the overall performance. Unlike something like the i7, the Propeller doesn't have multiple functional units which can be allocated to multiple threads so while thread 1 is executing a floating point operation or waiting on memory or I/O thread 2 can be executing integer operations. In fact, because thread/task management has to be done completely in software, the total performance will be lower. In other words, for maximum performance you add processors, not threads.
... if at 80MHz you have 20 MIPS and all 8 cogs run the same program instructions, then in terms of functionality it would still be 20 MIPS with lots of redundancy correct? :smilewinkgrin:
On the other hand if that same program were equally divided and distributed across the 8 cogs, so that each cog ran a portion of the program instructions, THEN you could say you were running 160 MIPS.
32 bytes could run a small program for demonstration, i.e. LED blinker program. It may illustrate timing, verify parameter passing, pins, counter, tasking, time slicing, and other added details.
The speed is slow, i.e. 4x slower than a BASIC Stamp 2, but overclocking will boost it a little, and 1000 IPS will still run programs. Those little blinker programs only have a few lines of code. If 10, then the program still runs at 1/100 of a second, entirely usable for demonstration purposes, right?
Leon: millihertz is 10 to the minus 3 hertz. You'd have to spin your cog backwards to come up with that number.
Your signature shows:
Beau Schwabe
IC Layout Engineer
Parallax Inc. * 599 Menlo Drive * Rocklin California 95765
but the avatar shows "Oklahoma."
Are you remote working?Humanoido
Exactly the whole point behind parallel processing! Again, it's going to take imagination and some clever thought proccesses to bring parallel proccessing past the large data crunching days. Far as i can see in terms of definition, parallel proccessing seems to be not even implemented on the prop in the majority of applications. ie; say your main program is to do video editing like an XOR opperation on individual video frames and a static picture like an overlay. Using parallel proccessing by definition would mean loading say 4 of the cogs with the same XOR PASM code but passing different sections of the same video frame and static picture to different cogs. In this parallel application, the video proccessing app finishes the same frame of video and overlay opperation 4x faster than a single cog crunching the data.
I think the main thing the prop is best at is multitasking in parallel, gathering multi input while outputting multiple output all at the same time is ideally suited to the task of the P8X32A, however seeing the webinars of the preliminary features of the Prop II unlock many new possibilities for true parallel proccessing! Especially with divided Vdd pin groups, 1:1 instructions, 2x clkfreq and 64 I/O's. Really looking forward to the Prop II
You raise an interesting point and one that I've been wrestling with for some time now. Most programs written for the Prop do not, in fact, use parallel processing at all. Rather, they implement one thread of execution that's serviced by one or more software peripherals, leaving several cogs idle. In a true parallel processing regime, there would not be any idle cogs. They would be assigned by the compiler to various portions of the main task. Unfortunately, Spin, being strictly procedural, is not the most convenient platform with which to implement such a scheme. A better programming paradigm might be dataflow, a message-passing system in which parallel resources are allocated automatically and dynamically. The downside of this, of course, is a loss of determinism, which may trump any advantages of parallel processing for the types of apps the Propeller is most adept at. This does beg the question, though: would the Prop be more adept at truly parallel apps with different dev tools?
-Phil
I found this link.
http://machineinteltech.com/Memory_Expansion.html
Humanoido
The source you cited represents a formerly open sore around here that's better left unscratched. Moreover, their website seems devoid of any recent activity. My suggestion would be see if Bill Henning has something to offer. But be forewarned that any external memory will not be addressable with the efficiency afforded by the Prop's own internal hub RAM. Don't forget that hub RAM addresses are 16 bits. That's just enough for the Prop's internal 32K of RAM + 32K of ROM.
-Phil