How Fast Can you Make Your Propeller?
Humanoido
Posts: 5,770
We know that assembly can run faster than Spin and some assembly operations are faster than others. Using a single Propeller chip, how fast can you make it run, using a simple program, a single statement, hardware wiring, a technique, confirmed either with a software or hardware measurement, i.e. how close can you approach the theoretical specs quoted by Parallax (160 MIPs, 80 MHz) in the real world?
Humanoido
Humanoido
Comments
If I write something like:
all of those instructions are being executed at a rate of 4 Propeller clock cycles per instruction or 20 million per second for a normal 80MHz set up.
A result which you can verify with a counter on pin 0.
It is not difficult at all to achieve 160 MIPS with the right application.
I have a real-world application running on eight cogs that executes billions of full-speed instructions between each RdLong, WrLong, or Lock access. Since it uses one of Bill's super deluxe crystals, the chip is actually doing 200 MIPS.
Humanoido
You make a good point. Do you have a Forum post about your project? I usually swap between the 5 and 6.25 MHz crystals to go from 80 to 100MHZ clock. I would stick with the 6.25 but many software examples require the 5. So it's good thinking that Parallax put a tiny socket on their Demo and Proto boards. Outside of that, I use solderless breadboards. The breadboards work fine even when overclocking. I began working with some printed circuit boards that match large breadboards to more easily transfer designs from bread.
Humanoido
Here all we want to do is add numbers together in HUB.
Here the data we want to work on is in HUB, as is mostly true in the real world. The important operation here is just ADD. What performance are we getting now?
I have trouble working this out but:
When the loop has been around once that first rdlong takes 18 clocks, that is one HUB cycle (22) minus the time used in the JMP (4)
The second rdlong takes a full HUB cycle from that last HUB access, 22 clocks.
The wrlong takes one HUB cycle (22) minus the time spent on add and xor (8), so 14 clocks. By the way this demonstrates why it's important to interleave 2 or 3 "normal" instructions between HUB operations, they make good use of the HUB access waiting time.
What happens when we first hit the loop?
No idea, that rdlong can take any random time from 7 to 22 clocks depending on the HUB phase.
Total is 66 clocks per loop, average 11 clocks per instruction or about 7.3 MIPs. Less than half the theoretical peek. Note that we are only performing the ADD operation we want at 1.2 million per second. I.E. 16 times slower than we can do ADDs on COG data.
This is why I constantly drone on about the difference between "theoretical" and "actual" performance. And hence the unsuitability of calling anything built with Propellers a "super computer".
N.B. Those more experienced might want to correct my timing analysis.
N.B Yes, I know this is a terribly contrived example. It's intended to show the other end of the spectrum of performance on real world data.
HUB operations occur on a fixed schedule.
A COG's HUB operation execution is like taking a train to Berlin which leaves the station every half hour. If you make the train on time, you get to ride it then. If you miss the train by even a few seconds, you have to wait a half hour for the next train. Now, there are other trains that leave the station during that half hour, but they are not for you because they are not going to Berlin. Other passengers may ride the other trains to Brussels, etc..., but they are subject to the same half hour schedule restriction. Only one train can be in the station at any given time so everyone has to wait their turn.
I believe the above timing is *practically* correct. Some hub instructions may actually end a clock cycle earlier, but that tidbit is *practically* irrelevant in this case.
For years and years I was compelled to count every clock cycle of every loop and branch. I didn't use interrupts because there was no time for such an extravagance. Besides, the lack of determinism would have been intolerable.
So when Humanoido makes a big deal of achieving "theoretical" performance with a simple pin toggling routine, I'm puzzled at his amazement. But as Heather has illustrated, just because a processor is achieving theoretical performance doesn't mean it is doing anything useful. (Sorry, Heater, it was just too funny!)
And that leads to my last point: What is amazing to me are all the applications that succeed in converting a multi-core embedded controller into a general purpose CPU.
What an application does is ultimately much more important than the theoretical efficiency achieved, unless that application needs every clock cycle.
Horses for courses? I know nothing about horses
For example, my AES-128 routines require the following for basic operations:
Key Expansion ~4.4K cycles, Cipher ~16.1K cycles/block, and Inverse Cipher ~18K cycles/block. I have no idea how these compare to other processors (obviously they are much slower than a dedicated hardware implementation), but I suspect fairly well as AES doesn't require multiplication, division or floating point (i.e. stuff the Propeller doesn't have).
And although the Propeller has no floating point hardware, the same goes double for MFLOPS. Historically algorithms were compared based on the number of floating point operations required, ignoring any required integer or data movement operations. But since the number of clock cycles required to perform floating point operations has decreased (often to single cycles) the time required to perform the integer operations and data movement can no longer be ignored.
Therefore rather than think in terms of MIPS or MFLOPS or even MHz, the real question is how much time is required to get the desired work done. And often that question is dependent not only on the processor, but the algorithm and the language / optimization.
Many years ago I did a software implementation of Reed-Solomon ECC coding for a video data stream. (The actual implementation was in hardware using dedicated custom chips. My code was to create test data.) The original C code required several minutes per frame and was judged too slow. I did some profiling and optimized several hotspots dropping the time per frame to around a minute. This was still too slow so I hand coded the innermost routine in 8086 - dropping the time to less than 30 seconds per frame.
Nah, I think I understand the HUB well enough. However as you see my translation into numbers is fluent gibberish.
This morning my caffeine deprived brain was thrown off course by peeking in the Prop manual and finding all that 7 to 22 clocks business.
So without looking at your numbers I'll have another go:
A sequence of continuous rdxxxx/wrxxxx will execute at the rate of one every HUB revolution, 16 clocks.
The code here does not have enough "normal" instructions between HUB accesses to cause them to miss a HUB cycle.
Therefore execution timing is determined solely by the number of HUB accesses in the loop. So 3 * 16 is 48 clocks for the loop.
Oh look, you have the same result!
OK, now we can rewrite my conclusion:
Total is 48 clocks per loop, average 8 clocks per instruction or 10MIPS. Half the theoretical peek. Note that we are only performing the ADD operation we want at 1.7 million per second. I.E. nearly 12 times slower than we can do ADDs on COG data.
Hmm... doesn't change much does it?
Have you seen any program on any other chip running in the real world that exactly matches the theoretical speed quote on their manufacturing spec sheet? I have not.
Though toggling a pin is not useful to you, it does not mean it is not useful to others. There are many things we can do with pin toggling. An entire book could be written about it. Well, me too, when I first started LED blinking seemed relatively benign. Now I know it's the most useful, simple, affordable and powerful multi-purposed tool in the microprocessor world, for testing, experiments, debugging, analysis, display of numbers, determining timing, program function, menu, panel indicators, PWM output, signal tracer, voltage scale, ammeter, sensor, confirmation of inputs, power supply reckoning, serial device, interface tester, bit banger, code dev, and a temp substitute for a $2,000.00 oscilloscope to quick measure some timing, and the list goes on and on.
The point, proof of the pudding, was never in the pin toggling anyway. There is a higher reason which is explained precisely in the first post in this thread. (below)
Almost. The Intel 860 was spec'ed at 66MFLOPS when it was launched. It could work on the Fast Fourier Transform at that peek rate, all the integer loop housekeeping being done in parallel with the float ops.
The joke was that Intel did not know how to program their own chip to do that. Some other company had worked it out and Intel was not allowed to show us the code:)
Now we have seem little code snippets for the Prop that demonstrate both ends of the performance range you can expect.
They are both true!
How they apply to the application one has in mind is something one has to work out for oneself.
Quite so, makes all kinds of things possible when working on a shoestring budget with a little imagination.
The closer you edge toward this theoretical limit for a given project the more calories have to be spent optimizing the code line by line, all the time the code will become increasingly specialized to the job at hand. In the end its a balance...ease of coding & versatility vs. speed, where you decide to draw the line depends on the requirements of your specific application and how much time you want to dedicate.
1) Some real-world tasks, by their very nature, keep the Prop ticking along at 160 MIPS.
2) Other real-world tasks would be very difficult (or impossible) to code in such a way that every cog is kept busy every moment.
Maxing-out a processor depends far more on the application than it does on the processor or the programmer. With the right application, it's utterly simple to max-out a Propeller. And it's utterly simple to prove that the Propeller is maxed out.
I knew exactly how fast a particular algorithm would run on the Propeller before I ever ran it. I predicted the results would appear in 28 minutes and 30 seconds. They appeared in 28 minutes and 28 seconds. The difference (of two seconds) is explained by the fact that the crystal in my watch and the crystal I wired to the Prop are neither perfect nor matched. If I'd somehow used the same crystal to clock both devices, the results would have matched the prediction exactly.
Microcontrollers like the Prop are deterministic. The more assembly code you write and test, Humanoido, the better you will understand how they work, and how fundamentally simple they are. It is in connecting to the external world, and specifically in WAITING for the external world, that slowdowns and unpredictabilities arise.
"MIPS", without further qualification, is an entirely meaningless concept. Total throughput relies on much more than instruction execution rate. In many applications, a Prop cog running at 20 MIPs will outperform an SX at 75 MIPs; in other apps, the SX will be faster. A lot depends on whether 32-bit computations are necessary or not. Another factor is the number and kind of addressing modes available. The Prop is a two-address machine. With full register-to-register operations, it can accomplish more in one instruction (all else considered) than a one-address machine, like the SX, that relies on an accumulator.
-Phil
Phil- can you elaborate on how to toggle a pin with no programming instructions?
And how to go beyond 20 MHz barrier using phsx? Are you using mirrors to trigger the event horizon?
Humanoido
The first of Phils proposals is that you can set up a counter to generate the same signal as my little PASM code loop. It will sit there doing that whilst your code then runs something else. Therefore that counter is worth 20MIPS of processing and you now have an effective 40MIPS COG:)
That's a bit harder, phsx is basically a counter that can be incremented by the timer hardware. If you can use the value of phsx as a pointer into COG memory, you now have a pointer that is incremented by the counter and not by your code. Effectively the timers are performing instructions for you, in parallel, and so you have exceeded 20MIPS.
Sorry I don't have an example of this.
It's just as useful as a pointer into HUB memory. Some might say more so
There is a condition in optics called constructive interference that can add to the light signal through waveform phasing. I'm thinking if there's an analogy with Propeller counters, and if those counter modes can be configured and run in tandem with the cog processing, we could build up a sub processor or coprocessor enhancement. As pointed out, it could make the Propeller chip exceed original specs in some ways.
It doesn't always take a big program to do something useful. Here we see something useful with no instructions. They say that tiny bugs inherit the world in part because of their sheer numbers, tiny size, and lack of a more complex biological structure. Perhaps a lot of tiny modes happening together can swarm into a greater purpose.
Humanoido
Little bugs have sensors, logic, memory, actuation, feedback, etc.
While you consider the role of some ersatz programming gimmick, you have how many fully-functional 32-bit CPUs sitting idle awaiting something useful to do?
What is it that you don't understand?
Counters are a dime a dozen - almost as cheap as words.
What does the cost of a counter have to do with anything? BTW, what are you paying for yours?
Little bugs have sensors, logic, memory, actuation, feedback, etc.
Indeed they do. Your point?
While you consider the role of some ersatz programming gimmick,
There is no consideration of any duplicate gimmick. We are developing techniques and algorithms for a greater cause. You would know if you read the introductory post.
you have how many fully-functional 32-bit CPUs sitting idle awaiting something useful to do?
I have no idle CPUs. Are you on something?
The topic of this thread is "How fast Can you make your Propeller?"
Well you can write PASM code in COGs at attain whatever speed it is. But what if the counters can some how be used in conjunction with your PASM to enhance the speed of whatever you are doing?
As has been suggested they can be used as auto incrementing pointers to speed up array traversal.
I think a classic example of this is Phil Pilgrims radio receiver built with just a Prop and a couple of resistors/capacitors.
For sure you could implement the algorithm he uses to do signal detection in PASM alone but it would be too slow to be of use. Drag the counters in to help and you have effectively multiplied the speed of the thing by a factor of 10 or whatever.
Hanno
"Are you on something?" I'm usually on an R1150R, but not at the moment.
Actually there are projects completed and various methods that are not in completed projects. At a glance, the latter could easily be missed. There are several threads exploring those capabilities and creating new cutting edge technology to make the Prop more like "Extended Prop." Stay tuned for more..
Humanoido
The 'obsolete' SX chips still may run at 100MIPs on one processor (That's 100Mhz clock and one instruction per clock - admittedly that is 8 bits and not 32 bits though).
Compared to the Propeller at 80Mhz, you still get 80MIPs per processor out of the SX and versus 20MIPs the Propeller. Seems slower. I believe that you can easily get a PIC to clock at 20Mhz and produce 20MIPs.
About the only feature that may be faster on the Propeller are those A and B Counters on the cogs. I believe they clock at 80Mhz.
It would seem that some BasicStamps are actually faster than the Propeller on a processor to processor basis. I am wondering if and when the Propeller will acquire one instruction per clock (The 8051s take 12 and multiples of 12!)
Well that's no small thing. If you want to do 32 maths on an SX you eaten up all those extra MIPS. Same for the PIC. You proably don't even win comparing 16 bit maths on the SX vs 32 bit on the Prop.
Comparing 8 bit MIPs to 32 MIBs is just not on.
Now quite why the Prop only runs at 20MHz in this day and age is another question. For sure big silicon is slower than small silicon and the Prop must be a lot bigger than an SX.