I guess 20 odd instructions is a good guess at the mul size.
I'm not including the size of the mul function in my code count for butterfly() though. It's a different function.
There is probably 2 instructions used for setting up the params to mull and 1 to get the result in the right place. So that only accounts for 9 instructions.
There is probably 2 instructions used for setting up the params to mull and 1 to get the result in the right place. So that only accounts for 9 instructions.
Still a huge gap!
The GCC Prop code does things like this
rdword r7, r7
shl r7, #16
sar r7, #16
and also note the RISC-V is a 3 operand, 5 bit field processor (32 regs), and whilst the Prop is natively a 2 operand 9 bit field(512 regs) , the GCC port does not fully use those 9 bit fields, instead it treats Prop as a 16 register (effectively 4 bit field), 2 operands core.
perhaps a more even comparison, would be against another 3 operand, 5 bit field core, eg from ARM - one with MIUL ?
That shl/sar thing is a sign extend from 16 to 32 bits. I got to wondering why it is there. Turns out the sin and cos tables in the fftbench are word wide, so it's rdword/shl/sar.
I changed the tables to 32 bit which gets rid of that.
But that only gets the Prop butterfly down from 249 instructions to 246.
That produces a butterflies() of 123 instructions. As compared to our previous record of 105 for the RISC V.
Edit: Just found my MIPS cross-compiler for OpenWRT. Result butterflies() = 115 instructions.
Edit: We forgot the x86. I have no 32 bit compiler but I see no reason why that should make much difference. Anyway the result...a monstrous 432 bytes !
Edit: Strike out "monstrous". My apologies to Intel. Half asleep this morning I was confusing bytes with instructions. 432 bytes is of course 108 longs which is pretty good.
As I said before I'm not about to suggest Chip scrap the P2 instruction set and swap to RISC V.
So I started staring at the gcc generated code and wondering if there were some simple thing that could be done to the Propeller instruction set to bring code size down to a more ARM/MIPS/RISC V level.
Potatohead has a good point when he says "...a big part of it is having three register operands. Takes two PASM instructions to do some things that can happen in one in RISC V."
The are a lot of sequences in the code like:
mov 14 <r5>, #12
add 14 <r5>, 40 <sp>
Always having to fetch the first operand, then add the second, rather than being able to do it in one instruction.
There are a total of 71 mov instructions in there. 30% of the instruction doing no serious work!
I don't see any way around this.
Somehow the hand crafted pasm version (heater_fft) comes in at 105 instructions. Less than half the prop-gcc generated code size.
I'm guessing we can't expect the prop-gcc's optimizer to ever improve to that level.
Is it so that prop-gcc might benefit from having 32 working registers instead of only 16?
Only 39 of a possible 64 basic instructions are used in the P1 so I see no reason why some 3 operand instructions could not be added. The 8 bits of the ZRCI and Condition along with 1 bit from the 6 bit instruction field could be used for the third address. Not sure if the same is true for the P2 since there are additional instructions.
Somehow the hand crafted pasm version (heater_fft) comes in at 105 instructions. Less than half the prop-gcc generated code size.
...
Is it so that prop-gcc might benefit from having 32 working registers instead of only 16?
Could be worth checking.
How many registers did the hand crafted pasm version (heater_fft) end up using ?
Is it so that prop-gcc might benefit from having 32 working registers instead of only 16?
Especially when not using CMM, why don't they just make it use all otherwise unused registers, including special registers when possible/necessary?
This is one of the main reasons I don't like PropGCC - you get a processor with 512 registers, and then pretend it only has 16. Especially in the case of fcache functions, more space is spent addressing all those hubram variables than is saved by only having 16 registers.
Is GCC (not necessarily PropGCC) not smart enough to understand that a processor can have a variable number of registers or what? Can't they just tell it that the Prop has 512 registers but most of them are usually used for special things (LMM kernel, shadow registers are too spooky, etc.) except when they're not? I would think it would be somewhat similar to how on an x86 (which I don't really know assembly for) you only use vector registers in special cases.
I've been thinking about this and the fast, small PASM programs do a good job of weaving things together.
An optimizer would need to look ahead several steps to come close to what we know can be done.
And I'm sure they already do, but skilled people do it better.
RISC V seems to present an easier case in that flow of data is part of the instruction design. Some nice things like stacks can be created and managed in software, as can how states are handled.
For people, it is nicer to have things like in direction addressing modes, stacks, etc...
We can fit the pieces together and see paths that might not be so easily quantified and generalized too.
That all said, I really like two operands. Seems a sweet spot for readability. But after looking through these outputs, I see three operands as superior.
The last three outputs are very interesting.
Intel is a mess, ranging from a couple bytes up to nine for some of the longer instructions I saw.
We see hot running chips with a lot of complexity from Intel and they are fast. ARM doesn't yet go as fast, but does excel in compute per watt as did MIPS.
I'm thinking about the more general, compact and regularly structured instruction style as the better one for embedded and power sensitive applications for sure.
Deffo interesting discussion. It continues to be worth following the RISC V project efforts.
It would be awesome to make a RISC Prop, P1 style. Most of the necessary boot strapping is done.
Call it the "optimized for C" edition. Then take the P2 "hot" and it's optimized for people edition.
How many registers did the hand crafted pasm version (heater_fft) end up using ?
No idea, idea don't have the code to hand just now. But it's very few. There aren't may variables in that algorithm and a few temps for calculating with may be. It does not matter, the point is they are always "just there" ready to be updated or used. None of this moving them to some fictitious "register" location before anything can be done with them.
Now in a compiler the code generator/optimizer knows what variables it has in what registers at every step of the way. It knows that it just loaded variable X into R1 and variable Y into R2 so if it needs X or Y again it need not load them again, just use R1 and R2. Potentially a small calculation in a loop could work entirely from registers and never access RAM. Which in fact they do when they can.
Anyway, you got me thinking about all that and so I took a look at how prop-gcc is using registers in that fft code.
My conclusion is that there is something broken in prop-gcc. Some registers get used a lot others hardly at all. Especially during the actual butterflies() function, which is where we do all the work and which is loaded into COG by fache for fast execution, we find that 5 registers are never used, 7 of them are hardly used and 5 are not used at all!
Seems to me that if r0, r1, r2, r4,, r8 were used and the load spread around the others more evenly the code could be reduced to half it's size! And it's speed significantly boosted.
Disclaimer: I know nothing of code generators and optimizers so there may well be very good reasons for this anomaly.
Increasing the number of registers, say 32, would enable keeping pretty much all variables in COG.
This is one of the main reasons I don't like PropGCC - you get a processor with 512 registers, and then pretend it only has 16
As opposed to what other compiler for the Propeller? Which one do you use that does not have this problem?
Spin for example does not have the concept of registers, in COG or otherwise, at all. It's all stack based.
Is GCC ... not smart enough to understand that a processor can have a variable number of registers...
That's an interesting question.
I suspect that in general it is not possible for any compiler in any language.
Consider that we had a variable number of registers as you suggest:
Then at the time the compiler is generating code for a particular function/method it has no idea how many registers it can use for that function.
You could say, that's OK if the compiler knew about the whole program as it compiled it could work out what free COG space there is for the registers it wants.
Well that is not possible. That function could be called from many places in my program. Some times it will be called when there is a lot of free space in COG. Some times it will be called when the is hardly any. Should the compiler now generate many different versions of the same function that use different numbers of registers? And the figure out which version to call when?
It get's worse...the amount of free COG space for use as registers may not actually be known at compile time. It may depend on what is going on at run time. The compiler cannot know this.
And it's get's worse...What if my function is compiled into a library? In that case the compiler has no idea of what kind of program my function will end up in let alone what is available at run time!
You could get around all this by having registers dynamically allocated like some kind of "heap". But then you are into a sever run time overhead. But hey, that is what a compiler does anyway, if it wants to use a register it ensures the old content is saved on stack or back in main memory. That just brings us back to needing more registers. But I don't think a variable number of registers is possible.
...fast, small PASM programs do a good...An optimizer would need to look ahead several steps to come close to what we know can be done...skilled people do it better.
You know...I like tweaking, polishing and honning a piece of assembler, especialy PASM to its ultimate miniamalist or high speed, perfection as much as the next nerdy geek, but I have come to the conlusion that in general yor statment is wrong.
Consider the case of the Intel x86. The implementation of x86 has changed a lot over the years. Turns out it's pretty much impossible to write optimal assembler for it. Why? Well, simply because what is optimal one year may not be next year.
I first became aware of this over a decade ago. A common speed up technique is to unroll loops. Write out the code doezens or hundreds of times rather than wrap a loop around a couple of lines. We used to do that for loops in cryto algorithms to increase the speed by a factor of 8 or 10. I was taken aback one day to find that unrolling a loop actually slowed it down! Why? Because modern processors have caches. A small loop in cache runs fast, a big code sequence overflows the cache and gets a 10 or 100 times speed penalty?
When you start thowing in modern processor techniques, long pipelines, out of order execution, parallel execution etc it becomes impossible to predict how long any code sequence you write will take to run !
All in all it's just better to let the compiler deal with it.
On a different tack humans are good at optimizing small corners of code. Like PASM on a Prop. They are not good at optimizing big programs. For example when you realize that inlinning that little fuction is a worth while speed gain a compiler can do that in no time. The human is going to be there for a month hacking all the places the function is called and introducing bugs as he goes.
As such, I'm on the side of the RISV V philosophy. Get rid of all that junk that is there to help assembler programmers and never used by compilers.
Intel is a mess, ranging from a couple bytes up to nine for some of the longer instructions I saw.
Oh yeah. Here is a couple of interesting factoids:
The longest possible x86 instruction is 15 bytes!
On average Intel has added one instruction to the x86 instruction set every month for 30 years!
You would think that for the sake of code size and hence speed you would encode the most often used instructions in the smallest way. You can't have many single byte instructions so you had better choose them wisely. So why on earth is AAA (Ascii Adjust after Addition), an instruction that has not been used for three decades and then only rarely, a single byte opcode in x86? !
but I have come to the conclusion that in general your statement is wrong.
In most cases, I would agree with you, but the demoscene would beg to differ.
And I'm with them for a lot of reasons too. Your point about changes is a valid one. However, for a given CPU and task, I think people remain better. This is a lot like the chess problem. People may not eventually be better. And it's all increasingly academic, but for some niches here and there.
So we agree! Weird way to get there though.
Intel is such an interesting case!
That single byte DAA may well be translated into a lot of microcode instructions. As such, it may take a byte going in, but require a lot of ops to perform. A 15 byte monster gets shoveled in, but it's translation is more direct, bringing the two to some sort of overall parity in terms of execute latency. Bizarre! And definitely not people friendly. However, that means the number of bytes isn't some direct link to performance, due to how the translations happen and what hardware resources are focused on.
And look! That x86 mess came in rather close on code size despite some monster instructions in there. Intel isn't wrong, just somewhat brutal about it all. And there isn't much currently that can do the task faster either. Where there is, it's niche hardware not capable or practical enough to use in a general way.
What I really like about the RISC idea is as we get smarter about compilers, we will get more out of those instructions. Same goes for caching, out of order, pipes, etc... And, if it plays out like MIPS did, we get very good compute / watt numbers that deliver nice and fast per clock. Like I've written before, do that and polish Linux up some, and it would be a very sweet environment.
We get most of this the Intel way, but it takes a very serious effort on Intel's part to actualize it. IMHO, things could be a lot more lean than they are in x86 land for sure. Look at how significant their efforts are to improve compute / watt. They will get it too. Close, like they always do, and binary compatibility will still mostly be there.
Take ARM. Intel is currently tuning in on much improved compute / watt numbers. I suspect they won't best ARM, like they never did MIPS, but they will say it's faster, point to shaving off a lot of power in add on hardware, etc... and say a solution can be better, even if the CPU isn't. They will win a lot doing that. And they've shown it before. If required, they will put a whole building full of people on it too. To them, x86 is the universal instruction set. It is, until somebody else makes it not.
Intel will say the cost of improving beyond that is loss of binary compatibility. Whatever we may think of that, they along with Microsoft really hit it out of the park by respecting the idea of keeping most things running no matter what. And Intel, as usual, is right! Enter RISC V. It's really gonna take somebody working very hard on the hardware implementation to get traction!
In open code land, it's all different. Build it on your new, spiffy CPU and go!
This rift will never close. But RISC V may well bring us an improved set of alternatives in open land. You know the data center guys are watching this. Back in the MIPS days, I would drop one IRIX box into a server room and have it do a ton of Smile and make almost no noise at all. They noticed. Today, it's scale and watts and throughput. They will notice again.
Edit: So here's a realization!
Intel is actually running a RISC type environment at the core. Then they build this massive translator, fetcher, thing which maps the x86 mess onto it optimally, and in hardware. Neat! And that's exactly why they always come close. They can tune for power, speed, whatever, and do so on the instruction set they own, and that runs a seriously large pile of software.
Maybe it's easier to think of RISC V as stripping all that away, moving the massive translator fetcher thing into software. So then, some sort of x86 to RISC translator can be made. ---which is precisely what DEC did with the Alpha. I ran that thing, and it was interesting. First run, bog slow. Second run faster, and program translation data got stored. Third run, near native Alpha performance when running an x86 program.
IMHO, RISC V has an opening here. There won't be that extra compatibility layer, and there is a lot of stuff running on open code that can take advantage with a recompile. It's all about stepping away from that costly x86 emulation layer, and that massive effort required to beat it into performing. That can be moved to software. The likes of Google, Facebook, et al. will take something like that and optimize the Smile out of it for what they do, and they will if the door actually gets opened to give it a go.
Javascript is a demoscene playground. It's just the sort of loose, highly exploitable language that will attract scene types.
As for modern, that really varies. Most of them will excel in a niche, then try others to build their skill or improve the challenge. Teams form too, old and new school to make some interesting productions.
Last year, at least one party featured a tech talk on the Atari VCS (2600) for new and upcoming sceners. Modern demo writers center in on things they find really interesting, old, new, a blend, etc...
I'm so tempted to get one and play, but I can't afford the time right now.
Really, I had hoped to see the P1 sprout a little scene, which it did for a while. Oh well, maybe this little board will bring a new group into the fold.
And I'm guilty, because we are the scene. If enough people can't afford the time, no scene. Gotta get back to that.
My conclusion is that there is something broken in prop-gcc. Some registers get used a lot others hardly at all. Especially during the actual butterflies() function, which is where we do all the work and which is loaded into COG by fache for fast execution, we find that 5 registers are never used, 7 of them are hardly used and 5 are not used at all!
Seems to me that if r0, r1, r2, r4,, r8 were used and the load spread around the others more evenly the code could be reduced to half it's size! And it's speed significantly boosted.
I'm sure a reason they chose 16 registers in Prop-GCC is to make immediate use of work already done.
ie a [two operand opcode, 16 register core] will already have had a code generator/optimizer, and that leaves some rules and opcode mapping.
Increasing the register count alone is of little use, unless a new optimizer is created.
Code generators also follow rules about what they try to save/avoid using, to limit push/pops, however that should affect only entry/exit sizes, not the inner loops.
No doubt a lot of work has been reworked for prop-gcc. But what about all the other 32 register CPU's GCC supports? I still don't see the need for only using 16 regs.
This RISC talk is interesting, but it seems to neglect the reasoning behind the tradeoffs that dictate the cog architecture being the way it is. That being memory and memory access, which is a particular concern at the relatively slow and non-dense process nodes of the P1, and what the P2 will be using. I mentioned before that I have little doubt in Chip's ability to design Harvard architecture cores, but how well can that work for multi-core uCs?
All the big boys in the high-performance multi-core CPU arena have to come up will all sorts of complex schemes to keep all those cores busy. Not only does that seem like an impossible undertaking for a small silicon design house like Parallax, but also goes not only against the prop ethos, but uC simplicity in general.
It's certainly hard enough deciding how much general purpose cog memory is enough, but imagine how much more difficult it would be to break that down into dedicated code and variable ram space a Harvard architecture would require. You could say "forget about cog ram altogether" but then you'll further slow things down due to hub memory bandwidth limitations. Not only that, but you throw determinism out the window. Oh, and you would reduce the total amount of memory further by now needing to make all of hub ram dual ported at the very least. Is that really any better?
The problem is that there's no solution that is ideal for all use cases, but I still can't help but think Chip chose wisely all things considered. Hub execute and the new hub ram partitioning might go a long way towards increasing increasing the performance of the Propeller design. I think the P2 will be able to pull off some functionality better than even multi-core 1GHz+ ARM chips can. Doesn't that say a lot? Give users something like a high-speed QSPI interface to communicate with these fancy ARM chips, and I think a lot of people will get blown away. The P2 doesn't have to do everything, it just needs to do some things really good, and I believe it just might.
No doubt a lot of work has been reworked for prop-gcc. But what about all the other 32 register CPU's GCC supports? I still don't see the need for only using 16 regs.
Good question, Wiki says this
["Due to the large number of bits needed to encode the three registers of a 3-operand instruction, RISC processors using 16-bit instructions are invariably 2-operand machines, such as the Atmel AVR, the TI MSP430, and some versions of the ARM Thumb.
RISC processors using 32-bit instructions are usually 3-operand machines, such as processors implementing the Power Architecture, the SPARC architecture, the MIPS architecture, the ARM architecture, and the AVR32 architecture."]
The only 32b one in the first 2-operand list, is "some versions of the ARM Thumb", and that is limited to only reach 16 registers. - so it does look like all preceding work was two-operand, 16 register.
I do not see any mainstream, 32 register, 2 operand cores ?
I agree with you, and it's just fun to muse about these things. The instruction set dynamics Heater put here were a lot of fun to think about. Like any muse, it's not really serious.
Frankly, I'm pretty stoked about the last design iteration. The very seriously improved HUB throughput is going to deliver amazing things.
The MIPS R10K was done at 350 nanometers, FYI. 350 - 450Mhz was FAST relative to Intel at twice the clock, and a smaller process. I think the process for P2 is more than up to the expectations we have for it. It just requires some attention to heat.
I do not see any mainstream, 32 register, 2 operand cores ?
Good point, more registers, means more operand bits means won't fit in 16 bit instructions. same as the Prop COG space is determined by the space available for it's operand fields.
That Spansion Hyperbus sounds interesting.
Ah, the Intel i860. What a disaster. When that launched a company I worked for wanted to know if it was suitable for what they wanted to do. So I went along to the Intel i860 launch shindig in London. One of the presentations was an in depth look at programming the thing in assembler. It was amazing. Let me see if I can recall how it went. Something like this:
1) The example the guy was working with was, guess what? The inner butterfly loop of an FFT!
2) So first he showed how to write the butterfly in a naive, regular looking way. OK it had three operand instructions which I was not used to but that looked easy enough. The performance of this was something like 5 or 10 MIPS, nowhere near the theoretical peek floating point performance 60 MIPS or so.
3) Next up he showed how to introduce use of the Multiply And Accumulate (MAD) instruction effectively getting two FLOPS for the price of one. The performance went up a bit. Cool.
4) Next up he showed something really off the wall. Parallel execution. You could write two instructions on each line! One integer and one floating point. The idea being that the integer ops are often your loop control instructions and the floating point ops are your real work. Might as well do them both at the same time. Performance went up a bit more.
Cool but starting to look very complicated to write.
5) Somewhere during the proceedings he introduced the i860's idea of pipe lining. Crucial to get decent performance out of this chip. Problem was the programmer had to take care of it. With pipe lining enabled the result of the instruction you write comes out five instructions later. That result get's written to the destination operand of the fith instruction down from it. Like so:
PFADD.SS F4, F2, F3 ; The result does NOT arrive in F4
PFSUB.DD F10, F8, F6
PFMUL.DD F16, F12, F14
PFADD.SS F19, F17, F18
PFADD.SS F22, F20, F21 ; The result of that first instruction arrives in F22 !
Now imagine you have similar going on with a column of floating point instructions to the right of all that.
At this point my brain exploded and I realized I was far to stupid to ever program this beast usefully.
6) Here is the kicker: On the last slide of the presentation, with all the above tricks in place that code was still short of the target 60 MIPS. The guy explained that 60 MIPS had been achieved but he could not show us how because it was done by some company outside Intel. That is to say Intel did know how to program their own chip!
Needless to say the company I worked for did not adopt the i860.
After all these years I find, and thank you for the link, that "hand-coded assemblers managed to get only about up to 40 MFLOPS, and most compilers had difficulty getting even 10 MFLOPs."
I'm relieved, seems it was not just me who was too stupid to program the i860 after all.
I wish I had kept the slides from that presentation.
I'd agree, but as well as QSPI, I would now add the Spansion Hyperbus (essentially two QPSI(ddr) + housekeeping) as that gives a cheap, low pin count means to add on large amounts of Flash/RAM (64Mb RAM just announced) to a Pin-Limited Microcontroller.
Memory/ Memory handling is too often the Achilles heel of hot-new-MCU designs.
I know you're the vocal proponent of the Hyperbus on this forum, and I can't argue with you on its technical/performance merits. All I'm parroting is what others here have suggested: that a high speed interface t an application processor may have significant merit (so that probably means QSPI).
I agree with you, and it's just fun to muse about these things. The instruction set dynamics Heater put here were a lot of fun to think about. Like any muse, it's not really serious.
Frankly, I'm pretty stoked about the last design iteration. The very seriously improved HUB throughput is going to deliver amazing things.
The MIPS R10K was done at 350 nanometers, FYI. 350 - 450Mhz was FAST relative to Intel at twice the clock, and a smaller process. I think the process for P2 is more than up to the expectations we have for it. It just requires some attention to heat.
I understand the interest in discussing these things, and I sincerely appreciate them as I've learned quite a bit that way. All I was trying to do was inject some points to the discussion which I believe were being neglected as they possibly relate to a multi-core setup at the particular process node the P2 will be using. In many ways, it's easy to see how the P2 will punch below its weight while neglecting the ways it will punch above it.
I won't argue there wasn't significant performance achieved from relatively large process nodes in the past that would put a 180nm P2 to shame in terms of single-thread throughput, but I will argue that it's really an apples/oranges comparison as those were all single core. For instance, look at jmg's i860 link above.. 16/16k code/data cache. Fitting multiple cores + efficiently feeding them with work/data would have been impossible (unless they made the die sizes MASSIVE). We would be furious with such limitations now a days.
@Heater
I attended a similar shindig here in Oz. My recollection of the event was the sales rep from Intel was cute and the sandwiches were nice.
Intel's efforts were not completely wasted. Here is my i860 databook being put to good use to lift my monitor to a comfortable height.
Cool. I only have some old 8086 era Intel data books some place.
I don't recall any cute sales reps from the London gig.
I do remember in one presentation they said the i860 was the first chip with one million transistors. That seems quite cute now.
I also remember that many exhibitors there were using INMOS Transputers with i860's to take care of the communication within a computer matrix. Sadly that did not go anywhere either.
I can't help but ponder about that pesky AAA instruction, and friends, in x86. Probably because it gave me such a headache to get right in the ZiCog emulator. I notice many other 8080/Z80 emulators have had bugs reported against their versions of DAA.
As far as I understand Intel's first processors were 4 bit wide, the 4004/4040. They were targeted at calculators where decimal arithmetic was key. I even saw one once, on a board in a pile of old prototypes at a company I worked at.
When they moved to 8 bits, 8080/8085, Intel had an eye to backward compatibility. Those algorithms created for the four bitters had to be moveable to the 8 bitters. That called for a Half Carry bit and the DAA. Besides it made easy for regular programmers to do BCD arithmetic in assembler saving valuable code space as a bonus.
When they moved to 16 bits, 8086, Intel again made it backwards compatible. Not at the binary instruction level but they provided as simple translator, conv86, that would take 8 bit assembly language and translate it into 8086 assembler which when built would run on your new 16 bit machine. On one job I hacked up a module carrying an 8088 that plugged into an 8085 socket on their processor cards and ran such converted code. That demo was the beginning of that companies first 16 bit board design. It was essential their old 8 bit code came along for the ride. This required that Intel bring DAA along for the ride as well, even if it was hardly ever used by that time.
And so DAA was hanging in their like some useless appendix as the i86 evolved. Until it was finally killed off in the 64 bit mode of the AMD64 design.
It's amazing to think that such compatibility goes all the way back to the 4004 of 1971 !
Seems, Parallax could have done themselves a favour by taking this lesson from Intel and ensuring that the Propeller ran BASIC Stamp code out of the box.
I do hope the P2 runs most P1 code without needing too much tweaking.
Comments
I'm not including the size of the mul function in my code count for butterfly() though. It's a different function.
There is probably 2 instructions used for setting up the params to mull and 1 to get the result in the right place. So that only accounts for 9 instructions.
Still a huge gap!
The GCC Prop code does things like this and also note the RISC-V is a 3 operand, 5 bit field processor (32 regs), and whilst the Prop is natively a 2 operand 9 bit field(512 regs) , the GCC port does not fully use those 9 bit fields, instead it treats Prop as a 16 register (effectively 4 bit field), 2 operands core.
perhaps a more even comparison, would be against another 3 operand, 5 bit field core, eg from ARM - one with MIUL ?
That shl/sar thing is a sign extend from 16 to 32 bits. I got to wondering why it is there. Turns out the sin and cos tables in the fftbench are word wide, so it's rdword/shl/sar.
I changed the tables to 32 bit which gets rid of that.
But that only gets the Prop butterfly down from 249 instructions to 246.
I just dug out my Linaro ARM cross-compiler.
$ arm-linux-gnueabihf-gcc -std=c99 -Os -o fftbench.arm fftbench.c
That produces a butterflies() of 123 instructions. As compared to our previous record of 105 for the RISC V.
Edit: Just found my MIPS cross-compiler for OpenWRT. Result butterflies() = 115 instructions.
Edit: We forgot the x86. I have no 32 bit compiler but I see no reason why that should make much difference. Anyway the result...a monstrous 432 bytes !
Edit: Strike out "monstrous". My apologies to Intel. Half asleep this morning I was confusing bytes with instructions. 432 bytes is of course 108 longs which is pretty good.
Dang, that RISC V is good!
Objdump outputs attached for compiler nerds.
So I started staring at the gcc generated code and wondering if there were some simple thing that could be done to the Propeller instruction set to bring code size down to a more ARM/MIPS/RISC V level.
Potatohead has a good point when he says "...a big part of it is having three register operands. Takes two PASM instructions to do some things that can happen in one in RISC V."
The are a lot of sequences in the code like: Always having to fetch the first operand, then add the second, rather than being able to do it in one instruction.
There are a total of 71 mov instructions in there. 30% of the instruction doing no serious work!
I don't see any way around this.
Somehow the hand crafted pasm version (heater_fft) comes in at 105 instructions. Less than half the prop-gcc generated code size.
I'm guessing we can't expect the prop-gcc's optimizer to ever improve to that level.
Is it so that prop-gcc might benefit from having 32 working registers instead of only 16?
How many registers did the hand crafted pasm version (heater_fft) end up using ?
Especially when not using CMM, why don't they just make it use all otherwise unused registers, including special registers when possible/necessary?
This is one of the main reasons I don't like PropGCC - you get a processor with 512 registers, and then pretend it only has 16. Especially in the case of fcache functions, more space is spent addressing all those hubram variables than is saved by only having 16 registers.
Is GCC (not necessarily PropGCC) not smart enough to understand that a processor can have a variable number of registers or what? Can't they just tell it that the Prop has 512 registers but most of them are usually used for special things (LMM kernel, shadow registers are too spooky, etc.) except when they're not? I would think it would be somewhat similar to how on an x86 (which I don't really know assembly for) you only use vector registers in special cases.
An optimizer would need to look ahead several steps to come close to what we know can be done.
And I'm sure they already do, but skilled people do it better.
RISC V seems to present an easier case in that flow of data is part of the instruction design. Some nice things like stacks can be created and managed in software, as can how states are handled.
For people, it is nicer to have things like in direction addressing modes, stacks, etc...
We can fit the pieces together and see paths that might not be so easily quantified and generalized too.
That all said, I really like two operands. Seems a sweet spot for readability. But after looking through these outputs, I see three operands as superior.
The last three outputs are very interesting.
Intel is a mess, ranging from a couple bytes up to nine for some of the longer instructions I saw.
We see hot running chips with a lot of complexity from Intel and they are fast. ARM doesn't yet go as fast, but does excel in compute per watt as did MIPS.
I'm thinking about the more general, compact and regularly structured instruction style as the better one for embedded and power sensitive applications for sure.
Deffo interesting discussion. It continues to be worth following the RISC V project efforts.
It would be awesome to make a RISC Prop, P1 style. Most of the necessary boot strapping is done.
Call it the "optimized for C" edition. Then take the P2 "hot" and it's optimized for people edition.
Now in a compiler the code generator/optimizer knows what variables it has in what registers at every step of the way. It knows that it just loaded variable X into R1 and variable Y into R2 so if it needs X or Y again it need not load them again, just use R1 and R2. Potentially a small calculation in a loop could work entirely from registers and never access RAM. Which in fact they do when they can.
Anyway, you got me thinking about all that and so I took a look at how prop-gcc is using registers in that fft code.
My conclusion is that there is something broken in prop-gcc. Some registers get used a lot others hardly at all. Especially during the actual butterflies() function, which is where we do all the work and which is loaded into COG by fache for fast execution, we find that 5 registers are never used, 7 of them are hardly used and 5 are not used at all!
Seems to me that if r0, r1, r2, r4,, r8 were used and the load spread around the others more evenly the code could be reduced to half it's size! And it's speed significantly boosted.
Disclaimer: I know nothing of code generators and optimizers so there may well be very good reasons for this anomaly.
Increasing the number of registers, say 32, would enable keeping pretty much all variables in COG.
Here are my register use counts: Any prop-gcc gurus know what's going on with this?
Spin for example does not have the concept of registers, in COG or otherwise, at all. It's all stack based. That's an interesting question.
I suspect that in general it is not possible for any compiler in any language.
Consider that we had a variable number of registers as you suggest:
Then at the time the compiler is generating code for a particular function/method it has no idea how many registers it can use for that function.
You could say, that's OK if the compiler knew about the whole program as it compiled it could work out what free COG space there is for the registers it wants.
Well that is not possible. That function could be called from many places in my program. Some times it will be called when there is a lot of free space in COG. Some times it will be called when the is hardly any. Should the compiler now generate many different versions of the same function that use different numbers of registers? And the figure out which version to call when?
It get's worse...the amount of free COG space for use as registers may not actually be known at compile time. It may depend on what is going on at run time. The compiler cannot know this.
And it's get's worse...What if my function is compiled into a library? In that case the compiler has no idea of what kind of program my function will end up in let alone what is available at run time!
You could get around all this by having registers dynamically allocated like some kind of "heap". But then you are into a sever run time overhead. But hey, that is what a compiler does anyway, if it wants to use a register it ensures the old content is saved on stack or back in main memory. That just brings us back to needing more registers. But I don't think a variable number of registers is possible.
Consider the case of the Intel x86. The implementation of x86 has changed a lot over the years. Turns out it's pretty much impossible to write optimal assembler for it. Why? Well, simply because what is optimal one year may not be next year.
I first became aware of this over a decade ago. A common speed up technique is to unroll loops. Write out the code doezens or hundreds of times rather than wrap a loop around a couple of lines. We used to do that for loops in cryto algorithms to increase the speed by a factor of 8 or 10. I was taken aback one day to find that unrolling a loop actually slowed it down! Why? Because modern processors have caches. A small loop in cache runs fast, a big code sequence overflows the cache and gets a 10 or 100 times speed penalty?
When you start thowing in modern processor techniques, long pipelines, out of order execution, parallel execution etc it becomes impossible to predict how long any code sequence you write will take to run !
All in all it's just better to let the compiler deal with it.
On a different tack humans are good at optimizing small corners of code. Like PASM on a Prop. They are not good at optimizing big programs. For example when you realize that inlinning that little fuction is a worth while speed gain a compiler can do that in no time. The human is going to be there for a month hacking all the places the function is called and introducing bugs as he goes.
As such, I'm on the side of the RISV V philosophy. Get rid of all that junk that is there to help assembler programmers and never used by compilers. Oh yeah. Here is a couple of interesting factoids:
The longest possible x86 instruction is 15 bytes!
On average Intel has added one instruction to the x86 instruction set every month for 30 years!
You would think that for the sake of code size and hence speed you would encode the most often used instructions in the smallest way. You can't have many single byte instructions so you had better choose them wisely. So why on earth is AAA (Ascii Adjust after Addition), an instruction that has not been used for three decades and then only rarely, a single byte opcode in x86? !
Similarly for AAD, AAM, AAS, DAA, DAS. Crazy.
In most cases, I would agree with you, but the demoscene would beg to differ.
And I'm with them for a lot of reasons too. Your point about changes is a valid one. However, for a given CPU and task, I think people remain better. This is a lot like the chess problem. People may not eventually be better. And it's all increasingly academic, but for some niches here and there.
So we agree! Weird way to get there though.
Intel is such an interesting case!
That single byte DAA may well be translated into a lot of microcode instructions. As such, it may take a byte going in, but require a lot of ops to perform. A 15 byte monster gets shoveled in, but it's translation is more direct, bringing the two to some sort of overall parity in terms of execute latency. Bizarre! And definitely not people friendly. However, that means the number of bytes isn't some direct link to performance, due to how the translations happen and what hardware resources are focused on.
And look! That x86 mess came in rather close on code size despite some monster instructions in there. Intel isn't wrong, just somewhat brutal about it all. And there isn't much currently that can do the task faster either. Where there is, it's niche hardware not capable or practical enough to use in a general way.
What I really like about the RISC idea is as we get smarter about compilers, we will get more out of those instructions. Same goes for caching, out of order, pipes, etc... And, if it plays out like MIPS did, we get very good compute / watt numbers that deliver nice and fast per clock. Like I've written before, do that and polish Linux up some, and it would be a very sweet environment.
We get most of this the Intel way, but it takes a very serious effort on Intel's part to actualize it. IMHO, things could be a lot more lean than they are in x86 land for sure. Look at how significant their efforts are to improve compute / watt. They will get it too. Close, like they always do, and binary compatibility will still mostly be there.
Take ARM. Intel is currently tuning in on much improved compute / watt numbers. I suspect they won't best ARM, like they never did MIPS, but they will say it's faster, point to shaving off a lot of power in add on hardware, etc... and say a solution can be better, even if the CPU isn't. They will win a lot doing that. And they've shown it before. If required, they will put a whole building full of people on it too. To them, x86 is the universal instruction set. It is, until somebody else makes it not.
Intel will say the cost of improving beyond that is loss of binary compatibility. Whatever we may think of that, they along with Microsoft really hit it out of the park by respecting the idea of keeping most things running no matter what. And Intel, as usual, is right! Enter RISC V. It's really gonna take somebody working very hard on the hardware implementation to get traction!
In open code land, it's all different. Build it on your new, spiffy CPU and go!
This rift will never close. But RISC V may well bring us an improved set of alternatives in open land. You know the data center guys are watching this. Back in the MIPS days, I would drop one IRIX box into a server room and have it do a ton of Smile and make almost no noise at all. They noticed. Today, it's scale and watts and throughput. They will notice again.
Edit: So here's a realization!
Intel is actually running a RISC type environment at the core. Then they build this massive translator, fetcher, thing which maps the x86 mess onto it optimally, and in hardware. Neat! And that's exactly why they always come close. They can tune for power, speed, whatever, and do so on the instruction set they own, and that runs a seriously large pile of software.
Maybe it's easier to think of RISC V as stripping all that away, moving the massive translator fetcher thing into software. So then, some sort of x86 to RISC translator can be made. ---which is precisely what DEC did with the Alpha. I ran that thing, and it was interesting. First run, bog slow. Second run faster, and program translation data got stored. Third run, near native Alpha performance when running an x86 program.
IMHO, RISC V has an opening here. There won't be that extra compatibility layer, and there is a lot of stuff running on open code that can take advantage with a recompile. It's all about stepping away from that costly x86 emulation layer, and that massive effort required to beat it into performing. That can be moved to software. The likes of Google, Facebook, et al. will take something like that and optimize the Smile out of it for what they do, and they will if the door actually gets opened to give it a go.
Modern demo writers make their code in JavaScript though, for portability. No, seriously http://js1k.com/2010-first/
Javascript is a demoscene playground. It's just the sort of loose, highly exploitable language that will attract scene types.
As for modern, that really varies. Most of them will excel in a niche, then try others to build their skill or improve the challenge. Teams form too, old and new school to make some interesting productions.
Last year, at least one party featured a tech talk on the Atari VCS (2600) for new and upcoming sceners. Modern demo writers center in on things they find really interesting, old, new, a blend, etc...
And then there is "wild" where you run what you brought. People are building little boards now, partially due to Linus. And there is this! A demoscene specific platform targeted at growing the scene in the US: http://hackaday.com/2015/03/05/revive-the-demoscene-with-a-layerone-demoscene-board/
I'm so tempted to get one and play, but I can't afford the time right now.
Really, I had hoped to see the P1 sprout a little scene, which it did for a while. Oh well, maybe this little board will bring a new group into the fold.
And I'm guilty, because we are the scene. If enough people can't afford the time, no scene. Gotta get back to that.
ie a [two operand opcode, 16 register core] will already have had a code generator/optimizer, and that leaves some rules and opcode mapping.
Increasing the register count alone is of little use, unless a new optimizer is created.
Code generators also follow rules about what they try to save/avoid using, to limit push/pops, however that should affect only entry/exit sizes, not the inner loops.
Hey, on the topic of Javascript stuff: http://www.windows93.net/? Just tossing it out there for entertainment. Enjoy your Saturday!
No doubt a lot of work has been reworked for prop-gcc. But what about all the other 32 register CPU's GCC supports? I still don't see the need for only using 16 regs.
@poatoehead,
Windows93 crazy stuff. And I thought running Linux in JavaScript was a weird enough idea: http://riscv.org/angel/ or http://bellard.org/jslinux/
I already enjoyed Friday too much. Absolutely cannot do any enjoying today
All the big boys in the high-performance multi-core CPU arena have to come up will all sorts of complex schemes to keep all those cores busy. Not only does that seem like an impossible undertaking for a small silicon design house like Parallax, but also goes not only against the prop ethos, but uC simplicity in general.
It's certainly hard enough deciding how much general purpose cog memory is enough, but imagine how much more difficult it would be to break that down into dedicated code and variable ram space a Harvard architecture would require. You could say "forget about cog ram altogether" but then you'll further slow things down due to hub memory bandwidth limitations. Not only that, but you throw determinism out the window. Oh, and you would reduce the total amount of memory further by now needing to make all of hub ram dual ported at the very least. Is that really any better?
The problem is that there's no solution that is ideal for all use cases, but I still can't help but think Chip chose wisely all things considered. Hub execute and the new hub ram partitioning might go a long way towards increasing increasing the performance of the Propeller design. I think the P2 will be able to pull off some functionality better than even multi-core 1GHz+ ARM chips can. Doesn't that say a lot? Give users something like a high-speed QSPI interface to communicate with these fancy ARM chips, and I think a lot of people will get blown away. The P2 doesn't have to do everything, it just needs to do some things really good, and I believe it just might.
Good question, Wiki says this
["Due to the large number of bits needed to encode the three registers of a 3-operand instruction, RISC processors using 16-bit instructions are invariably 2-operand machines, such as the Atmel AVR, the TI MSP430, and some versions of the ARM Thumb.
RISC processors using 32-bit instructions are usually 3-operand machines, such as processors implementing the Power Architecture, the SPARC architecture, the MIPS architecture, the ARM architecture, and the AVR32 architecture."]
The only 32b one in the first 2-operand list, is "some versions of the ARM Thumb", and that is limited to only reach 16 registers. - so it does look like all preceding work was two-operand, 16 register.
I do not see any mainstream, 32 register, 2 operand cores ?
I agree with you, and it's just fun to muse about these things. The instruction set dynamics Heater put here were a lot of fun to think about. Like any muse, it's not really serious.
Frankly, I'm pretty stoked about the last design iteration. The very seriously improved HUB throughput is going to deliver amazing things.
The MIPS R10K was done at 350 nanometers, FYI. 350 - 450Mhz was FAST relative to Intel at twice the clock, and a smaller process. I think the process for P2 is more than up to the expectations we have for it. It just requires some attention to heat.
That Spansion Hyperbus sounds interesting.
Ah, the Intel i860. What a disaster. When that launched a company I worked for wanted to know if it was suitable for what they wanted to do. So I went along to the Intel i860 launch shindig in London. One of the presentations was an in depth look at programming the thing in assembler. It was amazing. Let me see if I can recall how it went. Something like this:
1) The example the guy was working with was, guess what? The inner butterfly loop of an FFT!
2) So first he showed how to write the butterfly in a naive, regular looking way. OK it had three operand instructions which I was not used to but that looked easy enough. The performance of this was something like 5 or 10 MIPS, nowhere near the theoretical peek floating point performance 60 MIPS or so.
3) Next up he showed how to introduce use of the Multiply And Accumulate (MAD) instruction effectively getting two FLOPS for the price of one. The performance went up a bit. Cool.
4) Next up he showed something really off the wall. Parallel execution. You could write two instructions on each line! One integer and one floating point. The idea being that the integer ops are often your loop control instructions and the floating point ops are your real work. Might as well do them both at the same time. Performance went up a bit more.
Cool but starting to look very complicated to write.
5) Somewhere during the proceedings he introduced the i860's idea of pipe lining. Crucial to get decent performance out of this chip. Problem was the programmer had to take care of it. With pipe lining enabled the result of the instruction you write comes out five instructions later. That result get's written to the destination operand of the fith instruction down from it. Like so: Now imagine you have similar going on with a column of floating point instructions to the right of all that.
At this point my brain exploded and I realized I was far to stupid to ever program this beast usefully.
6) Here is the kicker: On the last slide of the presentation, with all the above tricks in place that code was still short of the target 60 MIPS. The guy explained that 60 MIPS had been achieved but he could not show us how because it was done by some company outside Intel. That is to say Intel did know how to program their own chip!
Needless to say the company I worked for did not adopt the i860.
After all these years I find, and thank you for the link, that "hand-coded assemblers managed to get only about up to 40 MFLOPS, and most compilers had difficulty getting even 10 MFLOPs."
I'm relieved, seems it was not just me who was too stupid to program the i860 after all.
I wish I had kept the slides from that presentation.
Edit: Changed my example code to something more realistic. Take from a page by a guy why actually liked writing that stuff http://blog.garritys.org/2010/04/intel-i860.html
I know you're the vocal proponent of the Hyperbus on this forum, and I can't argue with you on its technical/performance merits. All I'm parroting is what others here have suggested: that a high speed interface t an application processor may have significant merit (so that probably means QSPI).
I understand the interest in discussing these things, and I sincerely appreciate them as I've learned quite a bit that way. All I was trying to do was inject some points to the discussion which I believe were being neglected as they possibly relate to a multi-core setup at the particular process node the P2 will be using. In many ways, it's easy to see how the P2 will punch below its weight while neglecting the ways it will punch above it.
I won't argue there wasn't significant performance achieved from relatively large process nodes in the past that would put a 180nm P2 to shame in terms of single-thread throughput, but I will argue that it's really an apples/oranges comparison as those were all single core. For instance, look at jmg's i860 link above.. 16/16k code/data cache. Fitting multiple cores + efficiently feeding them with work/data would have been impossible (unless they made the die sizes MASSIVE). We would be furious with such limitations now a days.
I attended a similar shindig here in Oz. My recollection of the event was the sales rep from Intel was cute and the sandwiches were nice.
Intel's efforts were not completely wasted. Here is my i860 databook being put to good use to lift my monitor to a comfortable height.
Cool. I only have some old 8086 era Intel data books some place.
I don't recall any cute sales reps from the London gig.
I do remember in one presentation they said the i860 was the first chip with one million transistors. That seems quite cute now.
I also remember that many exhibitors there were using INMOS Transputers with i860's to take care of the communication within a computer matrix. Sadly that did not go anywhere either.
The world has changed, now were approaching sub-nano processes.
As far as I understand Intel's first processors were 4 bit wide, the 4004/4040. They were targeted at calculators where decimal arithmetic was key. I even saw one once, on a board in a pile of old prototypes at a company I worked at.
When they moved to 8 bits, 8080/8085, Intel had an eye to backward compatibility. Those algorithms created for the four bitters had to be moveable to the 8 bitters. That called for a Half Carry bit and the DAA. Besides it made easy for regular programmers to do BCD arithmetic in assembler saving valuable code space as a bonus.
When they moved to 16 bits, 8086, Intel again made it backwards compatible. Not at the binary instruction level but they provided as simple translator, conv86, that would take 8 bit assembly language and translate it into 8086 assembler which when built would run on your new 16 bit machine. On one job I hacked up a module carrying an 8088 that plugged into an 8085 socket on their processor cards and ran such converted code. That demo was the beginning of that companies first 16 bit board design. It was essential their old 8 bit code came along for the ride. This required that Intel bring DAA along for the ride as well, even if it was hardly ever used by that time.
And so DAA was hanging in their like some useless appendix as the i86 evolved. Until it was finally killed off in the 64 bit mode of the AMD64 design.
It's amazing to think that such compatibility goes all the way back to the 4004 of 1971 !
Seems, Parallax could have done themselves a favour by taking this lesson from Intel and ensuring that the Propeller ran BASIC Stamp code out of the box.
I do hope the P2 runs most P1 code without needing too much tweaking.