50 MIPS LMM on Prop1+
Ariba
Posts: 2,690
I have done a lot of studies for LMM for the Prop2 before the last shuttle run. At that time there was no Hubexecute and only quad wide hub-read/write. So it's quite similar to todays P1+ if Chip not implements hubexec.
I will show here a possible way to do Quad-LMM that works like Hubexec, only a bit slower on jump/calls. It needs only the RDQAUD and RDLONG instructions, no cached version of RDLONG.
A simple QuadLMM looks like that:
But you can execute only whole 4 instruction packets and jump only to quad aligned addresses, which is complicated for the compiler and is not very memory efficient (a lot of NOPs will have to be included).
To improve that we must be able to load a quad and jump to an instruction inside the quad on a Fjump. We also need helper cog-routines that do the FJMP, FCALL and FRET. Here is my solution, I show only the Fjump helper routines for now to make the code not too big:
For fjmp1..3 it takes 10 instructions to do the jump, one of them is the rdquad which has to wait for the hub. Fjmp4 takes a bit longer because it needs another hubread. So it is 20..28 syscycles, that is 5..6 times slower than normal instructions.
With hardware hubexec a jump will also need to load the right quad, so it can also take up to 16 cycles.
Fore sure these is all untested yet, we don't even have an FPGA image. If it works it may make hubexec unnecessary and we can remove about 15 instructions. In the next post I show the FCALL and FRET.
Andy
I will show here a possible way to do Quad-LMM that works like Hubexec, only a bit slower on jump/calls. It needs only the RDQAUD and RDLONG instructions, no cached version of RDLONG.
A simple QuadLMM looks like that:
QLMM rdquad ins1,ptra++ 'read 4 longs to ins1..ins4 nop 'spacer needed? ins1 nop '<-must this be quad aligned? ins2 nop ins3 nop ins4 nop jmp #QLMM 'a REP can speed it up a bit 'This executes 4 instructions in 16 sysclock cycles (best case). At 200MHz this is 200/16*4 = 50 MIPs.
But you can execute only whole 4 instruction packets and jump only to quad aligned addresses, which is complicated for the compiler and is not very memory efficient (a lot of NOPs will have to be included).
To improve that we must be able to load a quad and jump to an instruction inside the quad on a Fjump. We also need helper cog-routines that do the FJMP, FCALL and FRET. Here is my solution, I show only the Fjump helper routines for now to make the code not too big:
reset mov pc,codestart 'all fjumps/calls use the byteaddress FJMP setptra pc 'set quad addr shr pc,#2 'calc which instr inside quad and pc,#3 add pc,#ins1 rdquad ins1,ptra++ 'load quad jmp pc 'jump to instr pos in quad QLMM rep #511,#6 rdquad ins1,ptra++ 'load next quad for linear code nop ins1 nop ins2 nop ins3 nop ins4 nop jmp #QLMM 'only executed in case of FJMP or >511 instr 'an FJUMP in LMM code is done with a jump to the helper routine and a long for the jump address: jmp #_fjmp1 long @address 'it's most efficient if we have 4 fjump helper routines and the compiler sets a 'jump to _fjmp1 on ins1, to _fjmp2 on ins2 and so on. _fjmp1 mov pc,ins2 'ins2 holds the long with the jump address jmp #FJMP _fjmp2 mov pc,ins3 jmp #FJMP _fjmp3 mov pc,ins4 jmp #FJMP _fjmp4 rdlong pc,ptra 'for a jump in ins4 we need to read the addr from hub jmp #FJMP '
For fjmp1..3 it takes 10 instructions to do the jump, one of them is the rdquad which has to wait for the hub. Fjmp4 takes a bit longer because it needs another hubread. So it is 20..28 syscycles, that is 5..6 times slower than normal instructions.
With hardware hubexec a jump will also need to load the right quad, so it can also take up to 16 cycles.
Fore sure these is all untested yet, we don't even have an FPGA image. If it works it may make hubexec unnecessary and we can remove about 15 instructions. In the next post I show the FCALL and FRET.
Andy
Comments
jmp #_fcall1
long @address
The fcall helper routines:
The return from a LMM-Subroutine works always the same on all ins1..4: jmp #_fret
Loading a long constant to a register in QLMM code can be done with 2 instructions:
augs #32bits
mov reg,#32bits
but this works only if both are inside the same quad. So the compiler should check if a load long is on ins4 and insert a NOP in that case.
Andy
I'll have to study your examples in more detail, but basically I agree. I have said many times that we never really needed Hubexec, but multi-long hub reads would be nice. The rest can be done in software.
Of course, I'm fine with Hubexec if we can get it for nothing, but we don't. And I would not be willing to pay any cost - either in time to market, cost, power consumption, or cog speed - just to get Hubexec.
The main problem seems to be that some people want the P16X32B to be an ARM-killer. Which of course it never could be.
Ross.
Nice concept. I didn't look too closely though.
Unfortunately, the RDQUAD has to be quad aligned.
I think it's a question of what hubexec means to people which is why I always try to ask about the numbers not how the chip achieves them. What the compiler outputs doesn't matter to me one bit; how fast the chip will run it does.
Agreed, BUT it does need to have some USP to get it recognised in an increasingly crowded market. I must get 3 or 4 new chip announcements daily, probably three quarters of them are ARM-based but there are people still going their own way and they seem to be doing so successfully.
Datasheet front pages, and increasingly parametric selection pages, are the way many many chips get chosen so numbers, that fall under the right headings, do matter.
This is not a big deal to a compiler.
Ross.
I agree how fast the chip runs is critical, but for the Propeller, it is the speed of COG execution that matters. The speed of HUB execution is far less relevant.
Remember how slow SPIN was on the P1? (and I have to point out that SPIN was slower than HUB execution!) Did it really matter? No! - because COG execution was fast enough to compensate. And you had eight of them!
Ross.
It seems to me that ARM can be described as commoditized compute power, mixed with interface, made available in a continuous spectrum of performance and price. It also seems to be displacing every other architecture. Maybe it's doing this because it and everything it's displacing are more or less the same, all just means to an end. ARM plays the MacDonald's of the microcontroller world, offering the most consistent experience, perhaps. I picture people in business attire sitting in MacDonald's drinking coffee early, obliged to go through the rest of their day without much deviation. They don't relish the thought of that Big Mac they'll be eating for lunch later. It's all perfunctory.
This LMM discussion is good, by the way.
IE, if PC > COGRAM, then it points reads a quad ( into $1f0 or wherever ) from PC & $3fff0 ( if it's not already cached into shadow memory, which is just one quad ) that it executes $1f0 + (PC & $0c) >> 2 which should auto update after the last instruction is read?
If you see what I mean
That way, it always executes from COGRAM and doesn't need the extra instructions around it, and jumps would work directly without any messing around.
That's pretty much exactly how hub exec will work. When we see a $200+ address, we force an instruction read at $1FC..$1FF of four longs if the cache address matches. Otherwise, we wait for the next hub cycle and read the four longs from hub into $1FC..$1FF so that we can execute from them.
That statement makes me sad. The ARM was first designed, in 1983, by a couple of guys in small company that most of the world had never heard of. A couple of guys who thought, "F'it, we are not going to use Motorola 68000 or Intel whatever, we can design out own CPU for our next computer". And, amazingly, they did. Does that sound familiar?
By happen stance, the computer they built did not catch on, despite being faster than any other desk top machine at the time, but that processor design happened to offer what people wanted in other applications, low power, low price, small size, reasonable performance. At a time when Intel and friends were racing toward bigger, faster, hotter all the time.
ARM designs still come from a hand full of guys from a small company most of the world has never heard of.
Had a feeling it'd be something like that
Ahem... http://en.wikipedia.org/wiki/Sophie_Wilson
Didn't mean to make you sad, Heater. I just see ARMs taking over and I can see a future where engineers will be told to use them by management because... they are ARMs and you use ARMs to build embedded systems, just like everybody else does.
It took a rather long time before that became true. The market had to come into existence. Fancy graphics and more than a couple of kilobytes on portable devices didn't just blossom because the ARM was there. But there's no doubt ARM fitted right on in when the time came. One of the keys was to stick around long enough for it to happen.
To be fair, she was a man at the time...
Thing is, people don't use ARM chips. They use chips that have ARM cores in them. Made my Atmel, ST, NXP, etc. They license the ARM core because it's cheap and effective, and pair it with their hardware to make it an embedded chip. Others make them into full on CPUs (often with multiple ARM cores).
You know, Parallax could consider licensing the Propeller "core" to those same chip companies to make chips with.... might work out in the long run.
You've now got more and faster Cog's, which should alleviate a lot of concern from prospective new customers about running out of Cogs/Cores with needed peripherals. Good.
But, when it comes to mainline code of any size, you're still going to be stuck at Spin2, or C w/LMM 15-20 MIP's(?), IIRC.
At that point, won't most professionals compare that against a standard mid-large ARM, AVR or PIC32 and see that they can get that there, plus whatever hardware peripheral mix they want, with more speed? Actually, whether they need that speed or not, it certainly has the potential right there to have them cross it off the list as 'too slow'. Seems like a repeat of the P1, which is not needed.
Didn't LMM come about mostly to correct the Props 2K Cog RAM limitation?
I'm not sure its entirely valid to say that Cog execution was fast enough to compensate for HUB execution in the past, or will likewise do so in the future. It may have been fast enough for the designs/products ultimately produced using the P1, however that fails to take into account how many products were never realized because of the lack of RAM/speed in the first place.
I think one could equally hypothesize that the limited RAM, and relatively slow speed of execution of LMM show that it was not perceived to compensate enough for real interest/uptake.
I am quite possibly wrong, however from a couple of Ken's comments at one of the Expos at Rocklin, I seemed to get the impression that the uptake was helpful, but rather less than hoped for.
With a number of years of R&D $ already sunk, and probably the need for a decent revenue stream to try and recoup that, not to mention fund additional R&D in the future, seems like Parallax really needs to have some sort of LMM/HUBexec that can give a comparable MIP's throughput to get a second look. Or a third look, with a less than positive previous one years ago.
I think rushing to manufacture without will quite likely end up with a P2/P16 sitting out there with high initial interest, and then a lot posts across forums about how its a new chip, with all of the previously discussed limitations. Not good.
Not look to argue with anyone, just trying to play devil's advocate from the 30,000 foot view.
Ross is correct. However, what Ross describes does require certain amounts of "handling" to be effective. But it's also not an unreasonable expectation to require people to have to deal with such limits.
Otherwise, if the programming model is completely open-ended you tend toward some very bloated coding styles that are just a pure waste. And this point has some serious history behind it!
In a lot of different forums, AVR, PIC, etc, ARM is likened to the Borg.
Fact is, it a success story that Chip and company are trying to emulate in their own way, and should be looked at and studied for how they did what they did.
I'm sure there were times within ARM where there was similar strife and discord as they tried to make decisions that would change parts of their architecture. Some people probably dismissed talk of change as heretical to The One Way, while others championed it as an avenue to greater
future benefit.
Thing is, they made changes when they had to, against the wishes of some, and now they have the world in their pocket and Intel seriously worried.
The TOP PIC32MX chips run at 100Mhz, most run at 50Mhz or lower. The higher end PIC32MZ family gets up to 200Mhz, but is that a midrange part?
Most of the ARM based MCU chips are running 48Mhz or lower, with some getting near 100Mhz. It's not until you go to the more high end parts that you get high perf.
Also, real world perf is harder to come by than theoretical perf. I'm working on some code for a 48Mhz ARM based Atmel chip, and I'm having to do some heavy optimization to get it to do something that I have done trivially in a single P1 cog. And I'm even using the sercom SPI hardware to do part of the task, but it's still not fast enough yet. I can't make them change the chip in the product to be a P1, sadly.
However, it all comes back to the Engineer Parallax is trying to entice.
Its not unreasonable for anyone on this forum, however in addition to having to learn everything about the Prop that is new and different than standard ISA uC that they already know, not sure why it is entirely unreasonable. They have deadlines and commit dates, and this is probably an 'experiment' on their part to start with. With all of this, do they then want to expend more time and energy having to figure out how to do all this 'handling' as well?
If its possible for some form of HUBexec to come into play which avoids this, then it would seem like it would probably greatly improve the new Props chances.
Code Bloat, yes familiar in PC world and maybe in App world.
Not sure I've heard much about it in uC world, or really seen it used as an argument against uC RAM...
I do see what you mean about a bleak future of a mono-culture in CPUs.
However I would not lay all the blame at "management". There are powerful network effects going on as well. If there are a lot of people using a device, creating an operating system and drivers for it, fixing each other bugs and generally getting things to work, then it makes no sense for me to go with something different and have to face all that on my own.
The small company I worked for almost a decade ago did exactly that, we needed more power and space for our embedded systems. The ATMEL ARMs fitted the bill and we knew we could get Linux up and running on there by tapping into the work done by the community of all the other companies that were already using ATMEL ARM around the world.
Still, I'm hopeful that as long as there are independent souls like yourself, Sophie Wilson, David May, Andreas Olofsson and so on around that mono-culture will never totally solidify.
Brian,
Wilson was a guy at the time. Whilst Wilson did the design it did take the support of the company to do so. Think Chip and Parallax.
Roy,
You are so right. If the Propeller design was license to some ARM SoC manufacturer we could have all the Linux goodness of a Raspberry Pi with the IO and real-time capabilities of the Propeller.
koehler,
LMM came about because Bill Henning made the observation that because of the fixed 32 bit size of Propeller instructions, and the way the HUB worked, and whatever other details, a simple loop of four instructions could execute code from HUB at quite an impressive rate and make it possible to create a compiler to generate such resident HUB code.
I remember when Bill announced this discovery, it was magical.
Hi Ross, I respect your opinion but think you are wrong. I'd have to rewrite your statement as...
"I agree how fast the chip runs is critical, but for the Propeller, the speed of COG execution is the deciding factor once the speed of HUB execution reaches a certain amount."
And sadly 12-15 MIPS is nowhere near that amount.
In most general purpose microcontroller applications there is a bit of mainline code running and controlling the overall flow of data between the various IO sections and it's no good having super-smart IO if you can't make controlling decisions in a timely fashion. Trying to shoe-horn 'main' into 496 words isn't going to work.
If the core register count went up from 512 to something like 8k, or better yet 16k, then a slower central pool of RAM might work as a somewhere to stick variables that need to be shared across cores. But then, such slower memory isn't going to work in video applications.
Maybe, the problem is that the P1+ is trying to be an evolution of the P1, using lessons learned on the P2, when what should be being designed is the P3, a new chip with a new architecture.
This P1+ is going to be like a rocket, and don't forget yeah it may run at a now "moderate" 100MIPS, but there's 16 of those beasties!! plus it is a micro controller, not a CPU!
Well, I guess that depends on what you expect to be able to do with it.
For me, the chance of having sixteen cogs available to run parallel C code at 25 MIPS is pretty damned appealing. Far more appealing than running one cog at 200 MIPS.
If I wanted to do that, I wouldn't buy a Propeller - I'd buy something much cheaper and much faster.
Ross.
Looking on Digi-Key, halfway down the page I see M4 84Mhz, $3.02 64K RAM, 256KFlash, etc. Not sure what their avg. instruction cycles is.
AVR I'm looking at is 32 Mhz Xmega384 $5,32K RAM, 384K Flash, avg 1 Ins/cycle, ~32 MIPs
Agreed, may not be speed demons however someone who knows AVR knows their Mhz is going to be very close to MIPs, someone who knows ARM is going to have a good idea of what the Mhz/MIP's ratio is. And since these samples are probably at least 50% the cost of the new Prop, it obviously has to do better if its more.
What is the new Prop MIPs going be, or better, how will it compare as probably the first item a prospective customer checklist?
Just seems to me that if it compares only as good as mid-level uC's, then whether its easier or not, the fact thats it might be 100% more expensive probably means that it will never be ordered to even look at. And all the other goodies like pin functions, etc, never get found.
I just think for those reasons it would really, really be good for Parallax to be able to couple the 512K RAM datapoint with mainline code MIPs number as high as possible.
As for Atmel ARM, even the folks on AVR Freaks seem rather 'meh' with their offerings.
Personally, the LPC810 for $1.37 is a neat 8 pin DIP toy.
If it's possible to get 25 MIPS per core then it might be a runner but no-one to seems have said 'Yes, we can do that.' Ah, the fun of fluid specifications.