Shop OBEX P1 Docs P2 Docs Learn Events
How Fast Can you Make Your Propeller? — Parallax Forums

How Fast Can you Make Your Propeller?

HumanoidoHumanoido Posts: 5,770
edited 2010-09-21 08:35 in Propeller 1
We know that assembly can run faster than Spin and some assembly operations are faster than others. Using a single Propeller chip, how fast can you make it run, using a simple program, a single statement, hardware wiring, a technique, confirmed either with a software or hardware measurement, i.e. how close can you approach the theoretical specs quoted by Parallax (160 MIPs, 80 MHz) in the real world?

Humanoido
«13

Comments

  • Heater.Heater. Posts: 21,230
    edited 2010-09-13 07:58
    Most assembly instructions take 4 clocks. The waits, hub access and djnz are the exceptions.

    If I write something like:
    mov dira, #1
    loop    xor  outa, #1
            jmp #loop
    
    all of those instructions are being executed at a rate of 4 Propeller clock cycles per instruction or 20 million per second for a normal 80MHz set up.

    A result which you can verify with a counter on pin 0.
  • K2K2 Posts: 693
    edited 2010-09-13 21:14
    Humanoido,

    It is not difficult at all to achieve 160 MIPS with the right application.

    I have a real-world application running on eight cogs that executes billions of full-speed instructions between each RdLong, WrLong, or Lock access. Since it uses one of Bill's super deluxe crystals, the chip is actually doing 200 MIPS.
  • HumanoidoHumanoido Posts: 5,770
    edited 2010-09-13 22:19
    Hi Heater, thanks for creating & showing this code, detailing how it works and explaining the way it can be measured to illustrate a significant point. How can I say it - it's remarkable that quoted "theoretical" speed specs of the Propeller chip can actually be achieved in the real world.

    Humanoido
  • HumanoidoHumanoido Posts: 5,770
    edited 2010-09-13 22:29
    Hello K2,

    You make a good point. Do you have a Forum post about your project? I usually swap between the 5 and 6.25 MHz crystals to go from 80 to 100MHZ clock. I would stick with the 6.25 but many software examples require the 5. So it's good thinking that Parallax put a tiny socket on their Demo and Proto boards. Outside of that, I use solderless breadboards. The breadboards work fine even when overclocking. I began working with some printed circuit boards that match large breadboards to more easily transfer designs from bread.

    Humanoido
  • Heater.Heater. Posts: 21,230
    edited 2010-09-13 23:23
    OK here is an example of not hitting the theoretical peek:

    Here all we want to do is add numbers together in HUB.
         mov   dira,  #1        'For frequency counter output
    
    ' Continuously read LONGS a and b from HUB,
    ' add them and write result to r
    loop rdlong a, a_hub_addr   ' 18 clocks (must wait HUB max - 4)
         rdlong b, b_hub_addr   ' 22 clocks (must wait for HUB max)
         add    a, b            ' 4 clocks
         xor    outa, #1        ' 4 clocks, Toggle output
         wrlong a, r_hub_addr   ' 14 clocks (? wait HUB max - 8)
         jmp    #loop           ' 4 clocks
    
    Here the data we want to work on is in HUB, as is mostly true in the real world. The important operation here is just ADD. What performance are we getting now?

    I have trouble working this out but:

    When the loop has been around once that first rdlong takes 18 clocks, that is one HUB cycle (22) minus the time used in the JMP (4)

    The second rdlong takes a full HUB cycle from that last HUB access, 22 clocks.

    The wrlong takes one HUB cycle (22) minus the time spent on add and xor (8), so 14 clocks. By the way this demonstrates why it's important to interleave 2 or 3 "normal" instructions between HUB operations, they make good use of the HUB access waiting time.

    What happens when we first hit the loop?

    No idea, that rdlong can take any random time from 7 to 22 clocks depending on the HUB phase.

    Total is 66 clocks per loop, average 11 clocks per instruction or about 7.3 MIPs. Less than half the theoretical peek. Note that we are only performing the ADD operation we want at 1.2 million per second. I.E. 16 times slower than we can do ADDs on COG data.

    This is why I constantly drone on about the difference between "theoretical" and "actual" performance. And hence the unsuitability of calling anything built with Propellers a "super computer".

    N.B. Those more experienced might want to correct my timing analysis.

    N.B Yes, I know this is a terribly contrived example. It's intended to show the other end of the spectrum of performance on real world data.
  • jazzedjazzed Posts: 11,803
    edited 2010-09-14 07:21
    @Heater, Perhaps you misunderstand hub timing a little or just need some coffee. We've all been there :)

    HUB operations occur on a fixed schedule.

    A COG's HUB operation execution is like taking a train to Berlin which leaves the station every half hour. If you make the train on time, you get to ride it then. If you miss the train by even a few seconds, you have to wait a half hour for the next train. Now, there are other trains that leave the station during that half hour, but they are not for you because they are not going to Berlin. Other passengers may ride the other trains to Brussels, etc..., but they are subject to the same half hour schedule restriction. Only one train can be in the station at any given time so everyone has to wait their turn.
         mov   dira,  #1        'For frequency counter output
    
    ' Continuously read LONGS a and b from HUB,
    ' add them and write result to r
    loop rdlong a, a_hub_addr   ' Clock N = point where this instruction is done
         rdlong b, b_hub_addr   ' N+16
         add    a, b            ' N+20
         xor    outa, #1        ' N+24
         wrlong a, r_hub_addr   ' N+32
         jmp    #loop           ' N+36 ... loop rdlong ends at N+48
    

    I believe the above timing is *practically* correct. Some hub instructions may actually end a clock cycle earlier, but that tidbit is *practically* irrelevant in this case.
  • K2K2 Posts: 693
    edited 2010-09-14 07:34
    I think my programming background is different from most folks here. Almost without exception, my experience has been in real-time embedded control, and most of my applications have included a substantial DSP content.

    For years and years I was compelled to count every clock cycle of every loop and branch. I didn't use interrupts because there was no time for such an extravagance. Besides, the lack of determinism would have been intolerable.

    So when Humanoido makes a big deal of achieving "theoretical" performance with a simple pin toggling routine, I'm puzzled at his amazement. But as Heather has illustrated, just because a processor is achieving theoretical performance doesn't mean it is doing anything useful. (Sorry, Heater, it was just too funny!)

    And that leads to my last point: What is amazing to me are all the applications that succeed in converting a multi-core embedded controller into a general purpose CPU.

    What an application does is ultimately much more important than the theoretical efficiency achieved, unless that application needs every clock cycle.
  • jazzedjazzed Posts: 11,803
    edited 2010-09-14 08:18
    The usefulness of any application is relative. Toggling a pin is useful during debug when other tools are not available. Running Wordstar was useful when other editors were not available. Using VI for programming is still preferred by some. An Android or iPhone is better than a laptop in some cases.

    Horses for courses? I know nothing about horses :)
  • ericballericball Posts: 774
    edited 2010-09-14 08:19
    I'll agree with K2 - MIPS really means "Meaningless Indication of Processor Speed". It may have been more useful back in the day when processors used microcode and instructions required multiple cycles (so which 1MHz 8 bit processor is faster: 6502, 6809 or Z80?), but these post-RISC days most processors execute instructions in a single clock cycle, meaning other factors typically have a greater impact on performance than a clockspeed (like memory interface).

    For example, my AES-128 routines require the following for basic operations:
    Key Expansion ~4.4K cycles, Cipher ~16.1K cycles/block, and Inverse Cipher ~18K cycles/block. I have no idea how these compare to other processors (obviously they are much slower than a dedicated hardware implementation), but I suspect fairly well as AES doesn't require multiplication, division or floating point (i.e. stuff the Propeller doesn't have).

    And although the Propeller has no floating point hardware, the same goes double for MFLOPS. Historically algorithms were compared based on the number of floating point operations required, ignoring any required integer or data movement operations. But since the number of clock cycles required to perform floating point operations has decreased (often to single cycles) the time required to perform the integer operations and data movement can no longer be ignored.

    Therefore rather than think in terms of MIPS or MFLOPS or even MHz, the real question is how much time is required to get the desired work done. And often that question is dependent not only on the processor, but the algorithm and the language / optimization.

    Many years ago I did a software implementation of Reed-Solomon ECC coding for a video data stream. (The actual implementation was in hardware using dedicated custom chips. My code was to create test data.) The original C code required several minutes per frame and was judged too slow. I did some profiling and optimized several hotspots dropping the time per frame to around a minute. This was still too slow so I hand coded the innermost routine in 8086 - dropping the time to less than 30 seconds per frame.
  • Heater.Heater. Posts: 21,230
    edited 2010-09-14 10:54
    @Heater, Perhaps you misunderstand hub timing a little or just need some coffee. We've all been there :)

    Nah, I think I understand the HUB well enough. However as you see my translation into numbers is fluent gibberish.

    This morning my caffeine deprived brain was thrown off course by peeking in the Prop manual and finding all that 7 to 22 clocks business.

    So without looking at your numbers I'll have another go:

    A sequence of continuous rdxxxx/wrxxxx will execute at the rate of one every HUB revolution, 16 clocks.
    The code here does not have enough "normal" instructions between HUB accesses to cause them to miss a HUB cycle.
    Therefore execution timing is determined solely by the number of HUB accesses in the loop. So 3 * 16 is 48 clocks for the loop.
    loop
         rdlong a, a_hub_addr
                               '<--- T = 0  on entry,
                               '         48 after 1st time through etc.
         rdlong b, b_hub_addr
                               '<--- T = 16
         add    a, b
         xor    outa, #1
         wrlong a, r_hub_addr
                               '<--- T = 32
         jmp    #loop
    

    Oh look, you have the same result!

    OK, now we can rewrite my conclusion:

    Total is 48 clocks per loop, average 8 clocks per instruction or 10MIPS. Half the theoretical peek. Note that we are only performing the ADD operation we want at 1.7 million per second. I.E. nearly 12 times slower than we can do ADDs on COG data.

    Hmm... doesn't change much does it?
  • HumanoidoHumanoido Posts: 5,770
    edited 2010-09-14 11:10
    K2 said, So when Humanoido makes a big deal of achieving "theoretical" performance with a simple pin toggling routine, I'm puzzled at his amazement. But as Heather has illustrated, just because a processor is achieving theoretical performance doesn't mean it is doing anything useful
    Hi K2, if Heather said that, then she forgot to quote simple logic - the processor may achieve something useful as well.

    Have you seen any program on any other chip running in the real world that exactly matches the theoretical speed quote on their manufacturing spec sheet? I have not.

    Though toggling a pin is not useful to you, it does not mean it is not useful to others. There are many things we can do with pin toggling. An entire book could be written about it. Well, me too, when I first started LED blinking seemed relatively benign. Now I know it's the most useful, simple, affordable and powerful multi-purposed tool in the microprocessor world, for testing, experiments, debugging, analysis, display of numbers, determining timing, program function, menu, panel indicators, PWM output, signal tracer, voltage scale, ammeter, sensor, confirmation of inputs, power supply reckoning, serial device, interface tester, bit banger, code dev, and a temp substitute for a $2,000.00 oscilloscope to quick measure some timing, and the list goes on and on.

    The point, proof of the pudding, was never in the pin toggling anyway. There is a higher reason which is explained precisely in the first post in this thread. (below)
    Humanoido said, We know that assembly can run faster than Spin and some assembly operations are faster than others. Using a single Propeller chip, how fast can you make it run, using a simple program, a single statement, hardware wiring, a technique, confirmed either with a software or hardware measurement, i.e. how close can you approach the theoretical specs quoted by Parallax (160 MIPs, 80 MHz) in the real world?
  • Heater.Heater. Posts: 21,230
    edited 2010-09-14 11:22
    You know, there is a Heather on this forum, she has yet to make her first post.
  • Heater.Heater. Posts: 21,230
    edited 2010-09-14 11:34
    Humanoido,
    Have you seen any program on any other chip running in the real world that exactly matches the theoretical speed quote on their manufacturing spec sheet?
    Almost. The Intel 860 was spec'ed at 66MFLOPS when it was launched. It could work on the Fast Fourier Transform at that peek rate, all the integer loop housekeeping being done in parallel with the float ops.

    The joke was that Intel did not know how to program their own chip to do that. Some other company had worked it out and Intel was not allowed to show us the code:)

    Now we have seem little code snippets for the Prop that demonstrate both ends of the performance range you can expect.

    They are both true!

    How they apply to the application one has in mind is something one has to work out for oneself.
    ...when I first started LED blinking seemed relatively benign. Now I know it's the most useful, simple, affordable and powerful multi-purposed tool in the microprocessor world...
    Quite so, makes all kinds of things possible when working on a shoestring budget with a little imagination.
  • Miner_with_a_PICMiner_with_a_PIC Posts: 123
    edited 2010-09-14 12:05
    20 MIPs per cog @ 80 MHz is the short answer but attaining this tends toward being impractical in most real world applications.

    The closer you edge toward this theoretical limit for a given project the more calories have to be spent optimizing the code line by line, all the time the code will become increasingly specialized to the job at hand. In the end its a balance...ease of coding & versatility vs. speed, where you decide to draw the line depends on the requirements of your specific application and how much time you want to dedicate.
  • K2K2 Posts: 693
    edited 2010-09-14 12:57
    Rarely have I had a more difficult time communicating a more simple concept:

    1) Some real-world tasks, by their very nature, keep the Prop ticking along at 160 MIPS.

    2) Other real-world tasks would be very difficult (or impossible) to code in such a way that every cog is kept busy every moment.

    Maxing-out a processor depends far more on the application than it does on the processor or the programmer. With the right application, it's utterly simple to max-out a Propeller. And it's utterly simple to prove that the Propeller is maxed out.

    I knew exactly how fast a particular algorithm would run on the Propeller before I ever ran it. I predicted the results would appear in 28 minutes and 30 seconds. They appeared in 28 minutes and 28 seconds. The difference (of two seconds) is explained by the fact that the crystal in my watch and the crystal I wired to the Prop are neither perfect nor matched. If I'd somehow used the same crystal to clock both devices, the results would have matched the prediction exactly.

    Microcontrollers like the Prop are deterministic. The more assembly code you write and test, Humanoido, the better you will understand how they work, and how fundamentally simple they are. It is in connecting to the external world, and specifically in WAITING for the external world, that slowdowns and unpredictabilities arise.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2010-09-14 14:46
    Actually, the top end is more than 20 MIPS per cog if you include what the counters are capable of. Not only can you toggle a pin using no instructions at all, but some have even used the phsx registers as auto-incrementing pointers, thus saving an increment instruction and creating "virtual MIPS bursts" beyond 20.

    "MIPS", without further qualification, is an entirely meaningless concept. Total throughput relies on much more than instruction execution rate. In many applications, a Prop cog running at 20 MIPs will outperform an SX at 75 MIPs; in other apps, the SX will be faster. A lot depends on whether 32-bit computations are necessary or not. Another factor is the number and kind of addressing modes available. The Prop is a two-address machine. With full register-to-register operations, it can accomplish more in one instruction (all else considered) than a one-address machine, like the SX, that relies on an accumulator.

    -Phil
  • jazzedjazzed Posts: 11,803
    edited 2010-09-14 21:45
    Heater. wrote: »
    This morning my caffeine deprived brain was thrown off course by peeking in the Prop manual and finding all that 7 to 22 clocks business.
    Being coffee deprived is a withdrawal symptom :) I've been through the entire withdrawal before and it is painful ... it is of course just a plot for Juan Valdez to take over the world! :)
  • HumanoidoHumanoido Posts: 5,770
    edited 2010-09-14 23:47
    Actually, the top end is more than 20 MIPS per cog if you include what the counters are capable of. Not only can you toggle a pin using no instructions at all, but some have even used the phsx registers as auto-incrementing pointers, thus saving an increment instruction and creating "virtual MIPS bursts" beyond 20.

    Phil- can you elaborate on how to toggle a pin with no programming instructions?

    And how to go beyond 20 MHz barrier using phsx? Are you using mirrors to trigger the event horizon?

    Humanoido
    Propeller Manual v1.1 · Page 95: Each of the two counter modules can control or monitor up to two I/O pins and perform conditional 32-bit accumulation of the value in the FRQx register into the PHSx register on every clock cycle. Each Counter Module has its own phase-locked loop (PLLx) which can be used to synthesize frequencies from 64 MHz to 128 MHz.
  • Heater.Heater. Posts: 21,230
    edited 2010-09-15 01:11
    Humaniodo, you should take a look at the Application Note on counters from Parallax. There you will see examples of how to set up counters to generate different frequencies output directly to pins. Or PWM signals or use them as frequency counters.

    The first of Phils proposals is that you can set up a counter to generate the same signal as my little PASM code loop. It will sit there doing that whilst your code then runs something else. Therefore that counter is worth 20MIPS of processing and you now have an effective 40MIPS COG:)
    And how to go beyond 20 MHz barrier using phsx
    That's a bit harder, phsx is basically a counter that can be incremented by the timer hardware. If you can use the value of phsx as a pointer into COG memory, you now have a pointer that is incremented by the counter and not by your code. Effectively the timers are performing instructions for you, in parallel, and so you have exceeded 20MIPS.

    Sorry I don't have an example of this.
  • BradCBradC Posts: 2,601
    edited 2010-09-15 02:29
    Heater. wrote: »
    If you can use the value of phsx as a pointer into COG memory, you now have a pointer that is incremented by the counter and not by your code.

    It's just as useful as a pointer into HUB memory. Some might say more so ;)
  • HumanoidoHumanoido Posts: 5,770
    edited 2010-09-15 02:54
    Heater. wrote: »
    Humaniodo, you should take a look at the Application Note on counters from Parallax. There you will see examples of how to set up counters to generate different frequencies output directly to pins. Or PWM signals or use them as frequency counters.

    The first of Phils proposals is that you can set up a counter to generate the same signal as my little PASM code loop. It will sit there doing that whilst your code then runs something else. Therefore that counter is worth 20MIPS of processing and you now have an effective 40MIPS COG:)

    That's a bit harder, phsx is basically a counter that can be incremented by the timer hardware. If you can use the value of phsx as a pointer into COG memory, you now have a pointer that is incremented by the counter and not by your code. Effectively the timers are performing instructions for you, in parallel, and so you have exceeded 20MIPS. Sorry I don't have an example of this.
    Hi Heater, - I studied the app note AN001 - Propeller Counters (v1.2; 364 KB). With with so many logic modes my 1st thought was it would be possible to build up a tiny processor, software-wired with Propeller counters and software states. You've got logic, clock, and inspiration.

    There is a condition in optics called constructive interference that can add to the light signal through waveform phasing. I'm thinking if there's an analogy with Propeller counters, and if those counter modes can be configured and run in tandem with the cog processing, we could build up a sub processor or coprocessor enhancement. As pointed out, it could make the Propeller chip exceed original specs in some ways.

    It doesn't always take a big program to do something useful. Here we see something useful with no instructions. They say that tiny bugs inherit the world in part because of their sheer numbers, tiny size, and lack of a more complex biological structure. Perhaps a lot of tiny modes happening together can swarm into a greater purpose.

    Humanoido
  • User NameUser Name Posts: 1,451
    edited 2010-09-15 14:22
    I'm definitely not capturing the vision of this. Counters are a dime a dozen - almost as cheap as words.

    Little bugs have sensors, logic, memory, actuation, feedback, etc.

    While you consider the role of some ersatz programming gimmick, you have how many fully-functional 32-bit CPUs sitting idle awaiting something useful to do?
  • HumanoidoHumanoido Posts: 5,770
    edited 2010-09-15 14:47
    User Name said, I'm definitely not capturing the vision of this.

    What is it that you don't understand?

    Counters are a dime a dozen - almost as cheap as words.

    What does the cost of a counter have to do with anything? BTW, what are you paying for yours?

    Little bugs have sensors, logic, memory, actuation, feedback, etc.


    Indeed they do. Your point?

    While you consider the role of some ersatz programming gimmick,

    There is no consideration of any duplicate gimmick. We are developing techniques and algorithms for a greater cause. You would know if you read the introductory post.

    you have how many fully-functional 32-bit CPUs sitting idle awaiting something useful to do?


    I have no idle CPUs. Are you on something?
  • Heater.Heater. Posts: 21,230
    edited 2010-09-15 14:57
    User Name,
    I'm definitely not capturing the vision of this. Counters are a dime a dozen
    The topic of this thread is "How fast Can you make your Propeller?"

    Well you can write PASM code in COGs at attain whatever speed it is. But what if the counters can some how be used in conjunction with your PASM to enhance the speed of whatever you are doing?

    As has been suggested they can be used as auto incrementing pointers to speed up array traversal.

    I think a classic example of this is Phil Pilgrims radio receiver built with just a Prop and a couple of resistors/capacitors.

    For sure you could implement the algorithm he uses to do signal detection in PASM alone but it would be too slow to be of use. Drag the counters in to help and you have effectively multiplied the speed of the thing by a factor of 10 or whatever.
  • HannoHanno Posts: 1,130
    edited 2010-09-15 15:12
    When you use QuickSample in 4 cog mode your Prop will sample the 32 bits of the IO port every clock cycle. When clocked at 100MHz, your Prop will read 32bits*100Msps=3.2Gbit/second. The PropScope uses the video generator in one cog to continually drive the function generator. It outputs 8bits at 25Msps- good for 200Mbit/second. The PropCVCapture object of ViewPort captures NTSC video to hub- it looks for h/v sync's and then reads a encodes 8 pixel measurements into a long- I believe a pixel every 4 instructions... Finally, the Conduit used in ViewPort/PropScope/12Blocks transfers data full duplex over serial at 2Mbps. Put them all together and your Prop gets a wee bit warm :)
    Hanno
  • User NameUser Name Posts: 1,451
    edited 2010-09-15 17:14
    I stand corrected. Phil's radio example and Hanno's PropScope are two pretty impressive and concrete examples of getting more out of the Prop than it would seem capable of, at first glance.

    "Are you on something?" I'm usually on an R1150R, but not at the moment.
  • HumanoidoHumanoido Posts: 5,770
    edited 2010-09-16 05:50
    User Name wrote: »
    I stand corrected. Phil's radio example and Hanno's PropScope are two pretty impressive and concrete examples of getting more out of the Prop than it would seem capable of, at first glance.
    "Are you on something?" I'm usually on an R1150R, but not at the moment.
    Sounds like you have a choice of cycles, Prop or BMW... :)

    Actually there are projects completed and various methods that are not in completed projects. At a glance, the latter could easily be missed. There are several threads exploring those capabilities and creating new cutting edge technology to make the Prop more like "Extended Prop." Stay tuned for more..

    Humanoido
  • LoopyBytelooseLoopyByteloose Posts: 12,537
    edited 2010-09-17 03:29
    It seems to me something is very awkward with the math.

    The 'obsolete' SX chips still may run at 100MIPs on one processor (That's 100Mhz clock and one instruction per clock - admittedly that is 8 bits and not 32 bits though).

    Compared to the Propeller at 80Mhz, you still get 80MIPs per processor out of the SX and versus 20MIPs the Propeller. Seems slower. I believe that you can easily get a PIC to clock at 20Mhz and produce 20MIPs.

    About the only feature that may be faster on the Propeller are those A and B Counters on the cogs. I believe they clock at 80Mhz.

    It would seem that some BasicStamps are actually faster than the Propeller on a processor to processor basis. I am wondering if and when the Propeller will acquire one instruction per clock (The 8051s take 12 and multiples of 12!)
  • Heater.Heater. Posts: 21,230
    edited 2010-09-17 03:48
    Loopy Byteloose
    8 bits and not 32 bits though

    Well that's no small thing. If you want to do 32 maths on an SX you eaten up all those extra MIPS. Same for the PIC. You proably don't even win comparing 16 bit maths on the SX vs 32 bit on the Prop.

    Comparing 8 bit MIPs to 32 MIBs is just not on.

    Now quite why the Prop only runs at 20MHz in this day and age is another question. For sure big silicon is slower than small silicon and the Prop must be a lot bigger than an SX.
  • LeonLeon Posts: 7,620
    edited 2010-09-17 03:56
    The newer 8051s like those from Silicon Labs are one or two instructions per clock (up to 100 MIPS).
Sign In or Register to comment.