Wow, that looks like they had to tip-toe around some patent.
Saved : 1 carry Flag, a single Boolean register.
Cost : Consumes One whole 32 bit register to build a carry on the fly, plus five repeated references in code to that register.
They have not really 'eliminated' the carry flag, just instead created on on the fly.
I don't think that's the reason (patent), the RISC V guys claim (and I believe that's why MIPS did it that way too) that eliminating the status flags removes a lot of logic and simplifies some other issues, so it improves speed in the general case. And that was what RISC was all about when the term was invented.
Wikipedia says about the missing status flags register:
"Some RISC instruction set architectures dispose of centrally located arithmetic result status bits. The reasons cited are interlocking between commands setting/accessing status flag, which either leads to performance degradation or need of extra hardware to work around problems in pipelined, superscalar, and speculative processors. MIPS, DEC Alpha, and AMD 29000 are examples of such architectures."
Yes, and in the extreme using the -fcache option with prop-gcc stuffs chunks of code into COG and runs it at full speed. Like the inner loop of my FFT that ends up running nearly as fast as the hand crafted PASM version.
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
volatile long x = 1;
volatile long y = 2;
volatile long z = 2;
int main(int argc, char* argv[])
{
z = x + y;
}
I used objdump -d to get the binary disassembly.
Whist we are here, here is how 32 bit x86 and RISCV cope with 64 bit addition:
So they don't get optimized away and you are just looking at raw code generation and not optimizer efficiency at this point?
Do I get a prize????
Maybe but looking at the code that GCC generates with the optimizer turned off is pretty depressing. GCC depends a lot on the optimizer to generate efficient code.
As far as I understand "volatile" only disables optimizations that would cause the variable to not be read and written at the point's in the code where you expect it to be.
I would imagine that the actual loading and storing and adding is done the same way regardless of the volatile.
Anyway it turns out that with the variables as globals the volatile makes no difference. The code prop-gcc generates is exactly the same at -O0 and -O3
Move them into locals and the whole thing is optimized away as there is no input or output!
That's exactly the same as what Heater posted except that he expanded the mvi macros.
OK. The mvi macro does make the disassembled code look ugly. However, the optimizer should keep the addresses of x and y in registers if they were referenced later in the code. So the extra instructions to load the addresses wouldn't be so bad for code that used x and y more than once.
It would have been nice if the P1 had a way to move a 16-bit immediate value into a register in a single 32-bit instruction.
What's depressing is that if I remove the volatile I get the same code at -Os and 03 as -O0. That addition takes 10 longs.
Yeah, LMM isn't that efficient. Maybe the RISC V instruction set would be worth considering. I think code density is likely to still be a problem with hub exec on the P2 with the current COG instruction set.
I guess something like RISC V would be great if it executed directly from HUB. But then we suffer the speed penalty of 16 processors all trying share HUB RAM access. Not good.
The current Prop ISA at least allows code to run at full native speed within the confines of a COG.
A curiosity would be a RISC V ISA implementation that could also be read code from from COG space, keep it's 32 registers in COG space and access data in COG space. As well as being able to execute code and access data from HUB. These spaces just be distinguished by their address ranges.
Or more simply just have 16 RISC V machines sharing the bullk of RAM but each with it's own dedicated and fast couple of K's of space.
I think the base RISC V already includes facilities for guarding access to shared RAM in multi-core systems. So we are good to go:)
Cache memory solves the problem of sharing RAM. Of course, that introduces cache coherency issues, copy backs, cache lines and all the lovely things associated with cache memory.
Worst of all cache memory buggers up any hope of deterministic execution speed.
Except...I guess if code was cached, say a few kb of it, then that code cache would replace the need for COG code space. Once the cache is loaded you just run from it at full speed. As code is read only there are no issues of cache coherency and so on.
Data access could remain uncached, and round robin access as it is now. Or whatever the P2 is now doing. Except for some "magic" that would basically say "this data area can be loaded to cache after that it never goes back to main RAM" bingo you have full speed COG style data access as well. Again with no coherency issues.
I think we have a plan here!
Chip, Chip, we have re-examined our P2 requirements. Can you make us a RISC V Propeller?
I'm very happy to hear you are moving ahead with ChromeBook support.
Is that just for Spin or does it include C?
If it is just Spin, how are you doing it? Are you using Emscripten to get OpenSpin running in Chrome as JavaScript and using the Chrome serial port API?
That would be great because then it will all work as a Chrome extension on the PC and Mac and Linux as well.
I don't think that's the reason (patent), the RISC V guys claim (and I believe that's why MIPS did it that way too) that eliminating the status flags removes a lot of logic and simplifies some other issues, so it improves speed in the general case. And that was what RISC was all about when the term was invented.
Wikipedia says about the missing status flags register:
I'm still unconvinced the idea has engineering merit, but I can see the Marketing Types would get excited about being able to quote (slightly) higher MHz values (even if add then needs more opcodes)
As for speculative execution claims, the use of the made-on-the-fly carry register still needs to be protected, so I'm not seeing concrete reasons there.
Even the "dispose of centrally located arithmetic result status bits" is semantics, as they instead CREATE a carry bit when they need it in a register. That register is still centrally located, but yet, the name has gone.
hehe, sure, if you ignore the memory reach differences, and include all the load and saves.
The code that actually adds registers is what you should compare.
Compare of 12b reach with 32b address reach, will always give a 'win' to 12b
There is a clear cost to creating a register-carry, the claims are marketing-level vague about what 'gains' are made elsewhere to compensate, but for a embedded minion processor, code size always matters.
Less so on a XXMB of DRAM main CPU.
I am no CPU/instruction set designer so I cannot competently talk about "engineering merit" vs "marketing hype" when it comes to discussion of details about carry flags and so on.
From a CPU designers perspective I'm pretty sure that carry and other status flags are a complication to implement.
The proof is in the pudding of course, and as far as I can tell Mega Hz for Mega Hz things like ARM and Alpha and perhaps MIPS have outperformed x86 at the same clock rate.
What I am sure of is that instruction sets in the old days were designed with features to make life easy for assembly language programmers and a total disregard of what compilers can do.
The canonical example of this is the fact that Intel 8080, 8085, 8086 and upwards included a single byte AAA instruction. An instruction that is only useful to people working with Binary Coded Decimal. An instruction that is never used in the modern world and consumes a valuable one byte op code!
I might guess that all these carry and overflow and whatever flags are also a hang over from the days of making life easy for the assembler programmer. As I said who needs a carry in a 64 bit CPU?
This is born out by the fact that even Intel/AMD has dropped the AAA instruction in the 64 bit instruction set.
BUT, all this detail is ultimately irrelevant, the instruction set does not count for squat in the scheme of things, algorithms do, compilers do, individual coders do, chip technology does, memory speed does, caches do.
That is the thesis of RISC V. Give the instruction set is a small detail in the scheme of things, let's agree on one and use that. Then we don't have to depend on an Intel or an ARM any more. Chip fabs and SoC vendors like this idea.
Or more simply just have 16 RISC V machines sharing the bullk of RAM but each with it's own dedicated and fast couple of K's of space.
On detail of RISC V that appeals to me, is the 12b opcode address field, which gives a natural RISC-V 'COG-Size'
type local memory boundary of 8k bytes - Not a lot different to a Prop.
IIRC Chip was looking at making P2 jumps relative, so allow the same location independence.
From a CPU designers perspective I'm pretty sure that carry and other status flags are a complication to implement.
There, you may have a point - I suspect their Chizel Design tool handles the broard-stroke stuff well, and is productive there, but does not do the nitty gritty details like carry-flag management so well.
...sure, if you ignore the memory reach differences, and include all the load and saves.
The code that actually adds registers is what you should compare.
I'm not sure I understand what you mean.
The instruction that adds registers is only a small part of the story. For sure I have to fetch my operands from somewhere and save the result back again. Otherwise I have not done the job of x + y = z, for example.
I could have a infinitely fast ADD but take all day to fetch and write data. How useless would that be?
Reality is that most of the time of most processors is spent just moving data around not actually doing the operations you want.
My impression is that the Chisle design tool is quite capable of specifying any logic you want. Bit by bit. Flip flop by flip flop. Same as Verilog or VHDL.
Your Intel code fetches/stores from absolute 32b space, whilst the RISC-V code fetches from a much smaller 12b space. Ignoring that significant difference, skews comparisons. Apples and oranges problem.
So what you are saying is that RISC V is better for systems with small memory spaces. Sounds good to me
I guess my problem now is to figure out how to craft a test that puts the data further away.
Yup, their opcode design should gain on close-in data, and lose on large arrays and 32b space, and any data that is over the 12b reach.
Code that sums a small array, and then sums a large array, should show that ?.
I would expect their 16 bit extension to drop common opcodes like add to 16b.
That 12b relocatable reach detail, is a nice feature.
Since you are making a list, I would add another items that 'comes from below' & while it does not displace a P1 or P2 (it can work very well with a P1or P2) it does change the dynamics of choice to a new user.
7) Silabs EFM8BB1 - from $0.21 @ 10k, for 2KF and serious peripherals : 12b ADC/PWM/SPI/i2c/UID.
Back to why I am here, as it is apparent that a P2 won't be around this year, maybe not even the next, what should Parallax do in the meantime? As has been brought up the P1 makes a good hobbyist tool mostly because it is almost a single chip and in DIP form factor.
Likewise, it's fairly easy to comprehend and well-documented, and comes with quality free tools. DIP is an added bonus.
... and throw in a bit more HUB RAM, up the clock to 100mhz or so, and make it 5v tolerant. Just a small incremental improvement on the Prop 1, and it would sell well to 'maker' types,...
It would also likely make its way into more product designs. Often it's little things like those, especially in combination, which make or break the deal for an embedded engineer weighing all the trade-offs.
Ken,
The only issue I have is with the disruptive technology aspect that the Pi seems to be becoming.
The Pi gives them both a computer w/GUI, and decent uC access to GPIO, etc.
The VAR chops Parallax has probably puts the Pi to shame currently. However with such a massive market penetration and the lesson sharing that occurs in the educational system, I think they will quite likely vastly improve their actual educational material.
If that market is really profitable, and no doubt it is, I think it would be not very difficult and quite lucrative for someone and that could be anyone to put together a complete education package based on the Rasp Pi (or BeagleBone, or Arduino, or Netduino, or...). In fact, I wouldn't be a bit surprised if someone or some company is already doing that.
Comments
Wikipedia says about the missing status flags register:
Yes, and in the extreme using the -fcache option with prop-gcc stuffs chunks of code into COG and runs it at full speed. Like the inner loop of my FFT that ends up running nearly as fast as the hand crafted PASM version.
Whist we are here, here is how 32 bit x86 and RISCV cope with 64 bit addition:
RISCV 32 bit doing 32 bit addition: 44 bytes
Intel 32 bit doing 64 bit addition: 42 bytes.
A marginal win for x86 on code size there.
So they don't get optimized away and you are just looking at raw code generation and not optimizer efficiency at this point?
Do I get a prize????
I would imagine that the actual loading and storing and adding is done the same way regardless of the volatile.
Anyway it turns out that with the variables as globals the volatile makes no difference. The code prop-gcc generates is exactly the same at -O0 and -O3
Move them into locals and the whole thing is optimized away as there is no input or output!
It would have been nice if the P1 had a way to move a 16-bit immediate value into a register in a single 32-bit instruction.
What's depressing is that if I remove the volatile I get the same code at -Os and 03 as -O0. That addition takes 10 longs.
The current Prop ISA at least allows code to run at full native speed within the confines of a COG.
A curiosity would be a RISC V ISA implementation that could also be read code from from COG space, keep it's 32 registers in COG space and access data in COG space. As well as being able to execute code and access data from HUB. These spaces just be distinguished by their address ranges.
Or more simply just have 16 RISC V machines sharing the bullk of RAM but each with it's own dedicated and fast couple of K's of space.
I think the base RISC V already includes facilities for guarding access to shared RAM in multi-core systems. So we are good to go:)
Ah, such fantasies...
Except...I guess if code was cached, say a few kb of it, then that code cache would replace the need for COG code space. Once the cache is loaded you just run from it at full speed. As code is read only there are no issues of cache coherency and so on.
Data access could remain uncached, and round robin access as it is now. Or whatever the P2 is now doing. Except for some "magic" that would basically say "this data area can be loaded to cache after that it never goes back to main RAM" bingo you have full speed COG style data access as well. Again with no coherency issues.
I think we have a plan here!
Chip, Chip, we have re-examined our P2 requirements. Can you make us a RISC V Propeller?
I'm very happy to hear you are moving ahead with ChromeBook support.
Is that just for Spin or does it include C?
If it is just Spin, how are you doing it? Are you using Emscripten to get OpenSpin running in Chrome as JavaScript and using the Chrome serial port API?
That would be great because then it will all work as a Chrome extension on the PC and Mac and Linux as well.
Love to know more about all this.
I'm still unconvinced the idea has engineering merit, but I can see the Marketing Types would get excited about being able to quote (slightly) higher MHz values (even if add then needs more opcodes)
As for speculative execution claims, the use of the made-on-the-fly carry register still needs to be protected, so I'm not seeing concrete reasons there.
Even the "dispose of centrally located arithmetic result status bits" is semantics, as they instead CREATE a carry bit when they need it in a register. That register is still centrally located, but yet, the name has gone.
The big gains are in the compiler optimization and large, simple, fast caches.
The code that actually adds registers is what you should compare.
Compare of 12b reach with 32b address reach, will always give a 'win' to 12b
There is a clear cost to creating a register-carry, the claims are marketing-level vague about what 'gains' are made elsewhere to compensate, but for a embedded minion processor, code size always matters.
Less so on a XXMB of DRAM main CPU.
and here Nope, the intel opcode that adds is 24b, vs 32b, so you can see why the RISC-V is adding a new extension of 16b opcodes.
I wonder if there are any numbers yet on the added LUT cost of that ?
Is the new 16b opcpode extension viable for a P2 level / Minion processor ?
And RISC will back about CPU as much as it is anything.
I am no CPU/instruction set designer so I cannot competently talk about "engineering merit" vs "marketing hype" when it comes to discussion of details about carry flags and so on.
From a CPU designers perspective I'm pretty sure that carry and other status flags are a complication to implement.
The proof is in the pudding of course, and as far as I can tell Mega Hz for Mega Hz things like ARM and Alpha and perhaps MIPS have outperformed x86 at the same clock rate.
What I am sure of is that instruction sets in the old days were designed with features to make life easy for assembly language programmers and a total disregard of what compilers can do.
The canonical example of this is the fact that Intel 8080, 8085, 8086 and upwards included a single byte AAA instruction. An instruction that is only useful to people working with Binary Coded Decimal. An instruction that is never used in the modern world and consumes a valuable one byte op code!
I might guess that all these carry and overflow and whatever flags are also a hang over from the days of making life easy for the assembler programmer. As I said who needs a carry in a 64 bit CPU?
This is born out by the fact that even Intel/AMD has dropped the AAA instruction in the 64 bit instruction set.
BUT, all this detail is ultimately irrelevant, the instruction set does not count for squat in the scheme of things, algorithms do, compilers do, individual coders do, chip technology does, memory speed does, caches do.
That is the thesis of RISC V. Give the instruction set is a small detail in the scheme of things, let's agree on one and use that. Then we don't have to depend on an Intel or an ARM any more. Chip fabs and SoC vendors like this idea.
On detail of RISC V that appeals to me, is the 12b opcode address field, which gives a natural RISC-V 'COG-Size'
type local memory boundary of 8k bytes - Not a lot different to a Prop.
IIRC Chip was looking at making P2 jumps relative, so allow the same location independence.
The instruction that adds registers is only a small part of the story. For sure I have to fetch my operands from somewhere and save the result back again. Otherwise I have not done the job of x + y = z, for example.
I could have a infinitely fast ADD but take all day to fetch and write data. How useless would that be?
Reality is that most of the time of most processors is spent just moving data around not actually doing the operations you want.
I guess we have to try it out to be sure.
Your Intel code fetches/stores from absolute 32b space, whilst the RISC-V code fetches from a much smaller 12b space. Ignoring that significant difference, skews comparisons. Apples and oranges problem.
But it's a stripped down simpler implementation -- 32 bit, not pipelined, etc.
So what you are saying is that RISC V is better for systems with small memory spaces. Sounds good to me
I guess my problem now is to figure out how to craft a test that puts the data further away.
Yup, their opcode design should gain on close-in data, and lose on large arrays and 32b space, and any data that is over the 12b reach.
Code that sums a small array, and then sums a large array, should show that ?.
I would expect their 16 bit extension to drop common opcodes like add to 16b.
That 12b relocatable reach detail, is a nice feature.
Likewise, it's fairly easy to comprehend and well-documented, and comes with quality free tools. DIP is an added bonus.
Neither do I, certainly not as a business strategy.
It would also likely make its way into more product designs. Often it's little things like those, especially in combination, which make or break the deal for an embedded engineer weighing all the trade-offs.
If that market is really profitable, and no doubt it is, I think it would be not very difficult and quite lucrative for someone and that could be anyone to put together a complete education package based on the Rasp Pi (or BeagleBone, or Arduino, or Netduino, or...). In fact, I wouldn't be a bit surprised if someone or some company is already doing that.