OK, in an attempt to make the "reach" bigger how about this code:
volatile long junk1[100000000];
volatile long x = 1;
volatile long y = 2;
volatile long z = 2;
volatile long junk2[100000000];
void useJunk()
{
junk1[0] = 0;
junk2[0] = 0;
}
int main(int argc, char* argv[])
{
z = x + y;
useJunk();
}
That puts x, y and z in the middle of a huge data space.
I'm not sure what to make of this but it seems it's neck and neck.
If I was a chip or SoC vendor I could be confident to move ahead with RISC V and not suffer embarrassing performance hits. Which is what I suspect might happen.
I'm not sure what to make of this but it seems it's neck and neck.
I'm no RISC-V ASM expert, but that code looks incomplete.
It seems to still use 12b relative address to a preloaded (gp) and I'm not sure quite what the (sp) is doing, but that likely has a preload too. ie you need to include all code that loads (gp) and (sp)
Intel uses 7 byte opcodes, but given their wide-fetch paths these days, that may have little speed cost
As you see all I missed out was the call/jump to useJunk().
It's a mystery to me how that all works as well
Of course RISC V and anyone else can have wide fetch paths if they want. But you have to admit, minimizing the amount of redundant stuff going to and from memory can only be good.
As you see all I missed out was the call/jump to useJunk().
It's a mystery to me how that all works as well
The preamble stuff must load (gp), but you can see the 12b address at work here, if those x,y,z were not adjacent, or within the 12b reach the opcodes will grow
Strange, the RISC-V spec pdf I have, has no mention of (sp) or (gp) ( - another wry smile about ''standards'....)
The preamble stuff must load (gp), but you can see the 12b address at work here, if those x,y,z were not adjacent, or within the 12b reach the opcodes will grow
Strange, the RISC-V spec pdf I have, has no mention of (sp) or (gp) ( - another wry smile about ''standards'....)
I'm very happy to hear you are moving ahead with ChromeBook support.
Is that just for Spin or does it include C?
If it is just Spin, how are you doing it? Are you using Emscripten to get OpenSpin running in Chrome as JavaScript and using the Chrome serial port API?
That would be great because then it will all work as a Chrome extension on the PC and Mac and Linux as well.
Love to know more about all this.
Hey Heater,
PBASIC, Spin and eventually C. I think we're using the Chrome serial port FTDI driver. As you know, I don't know a lot about this project except that we've got the best developers on it (from Iced Dev), that Jeff Martin is managing it, we're seriously invested in the project, and that we'll be sharing the project each step of the way. So, while this is a non-answer of sorts, count on real information coming soon from people who know what they're talking about.
Who knows ? To me they sound like (general purpose pointer) and (Stack Pointer), but the pdf shows x0..x31 as registers, and x1/ra has the return address, so there seem to be stack limitations here too.
The (sp) above has stack like code, but I'm not sure why - I think the (sp) code inside the add is simply plonked there, and does not seem to be add-related.
Maybe the C code busy doing preserve the ret from the call to main, by building a stack in SW. Ouch.
Who knows ? To me they sound like (general purpose pointer) and (Stack Pointer), but the pdf shows x0..x31 as registers, and x1/ra has the return address, so there seem to be stack limitations here too.
The (sp) above has stack like code, but I'm not sure why - I think the (sp) code inside the add is simply plonked there, and does not seem to be add-related.
Maybe the C code busy doing preserve the ret from the call to main, by building a stack in SW. Ouch.
It certainly looks like stack frame maintenance code to me. Maybe it's because of the nested call to useJunk whatever that is.
It certainly looks like stack frame maintenance code to me. Maybe it's because of the nested call to useJunk whatever that is.
Data mentions a register return address x1/ra
If we strip the add code, it seem to be saving ra @ Sp-16(+8?), (as jal will need ra), then restores ra and (sp).
Those are common register names by convention, if RISC is following the MIPS path, which appears likely.
Stack in software.
This is Reduced Instruction in play after all.
You leave that stuff to the compiler. It can interleave things, find commonalities, and maximize both the instruction pipe (out of order, if it is like MIPS V) and caches.
Even the "dispose of centrally located arithmetic result status bits" is semantics, as they instead CREATE a carry bit when they need it in a register. That register is still centrally located, but yet, the name has gone.
To me it seems to be a major difference. Admittedly, I am no chip design expert. But the status flag register has traditionally been used to set flags based on the current operation (carry, for example, may be set by other operations than just 'add' also). This is something those of us who started with the 6502 and Z80 etc. have been familiar with for a long time.
But when you have CPUs where multiple instructions are executed in parallel, how do you set and keep the values for the status register? It seems to be a much cleaner and simpler approach to not set any such flags (which is the option MIPS, Alpha and RISC V went for). Then, when you need a flag, you use e.g. the STLU instruction to operate on specific registers and leave the result in a specific register. This doesn't collide with anything else a superscalar might be doing - either the compiler will have to arrange the instructions order so that registers are not used at the same time, or the CPU itself re-orders it so. But if you have a central status register where (traditionally) nearly every operation would set one or more flags, then there's no way out - you will either have to remove the parallelism, or have some kind of shadow status register copies hanging around that can be matched to the right operation. Which is probably what modern superscalar CPUs with status registers do. Sounds very complicated to me.
The difference can maybe be boiled down to 'setting flags on (nearly) every operation' and 'setting flags only when you need them'.
.... This doesn't collide with anything else a superscalar might be doing - either the compiler will have to arrange the instructions order so that registers are not used at the same time, or the CPU itself re-orders it so.
That rather convoluted argument only really applies where a device might "arrange the instructions order ", ie a device with something like
F Standard Extension for Single-Precision Floating-Point
or the
D Standard Extension for Double-Precision Floating-Point
could do that.
and that is certainly nowhere near a P2 COG replacement, or even any minion processor use.
The more I look at RISC-V, the more compromised it appears, for deeply embedded, small core work. Not something Parallax should invest time in, or even worry about affecting their market.
At risk of boring everybody except compiler nerds I rearranged my test a bit so as to isolate the test case and make sure our variables were spread far and wide.
Intel x86 code is 40% bigger than RISC V!
Notice how the test() function is in lined in both cases but it's code is still included as its a global function.
I have no idea how you arrive at 20 bytes for that near call. If I turn off optimization that near call is one instruction, 4 bytes. With optimization on it's zero bytes!
Seem the RISC V is quite well suited to low memory embedded devices. What do you mean by "compromised".
Next up I will try and get results for ARM to compare against.
volatile long junk1[1000000];
volatile long x = 1;
volatile long junk2[1000000];
volatile long y = 2;
volatile long junk3[1000000];
volatile long z = 3;
volatile long junk4[1000000];
void test()
{
z = x + y;
}
void useJunk()
{
junk1[0] = 0;
junk2[0] = 0;
junk3[0] = 0;
junk4[0] = 0;
}
int main(int argc, char* argv[])
{
useJunk();
test();
return 0;
}
I have no idea how you arrive at 20 bytes for that near call.
Directly from your own example of #394, that was the Sw-Stack and call overhead, as it moves to preserve ra
That was what I pasted in #400, == 20 bytes.
In your latest listing, that SW stack seems to have vanished, but so too has the jal
I also see no calls, so it is not clear which fruit we are now looking at, but I think neither apples nor oranges.
Zero is also a special case, so gives different RISC-V code size than other values.
In-line of tiny-functions is not a real-world test, but the fact they have chosen to duplicate that much, (16 bytes!) gives a hint of the CALL overheads lurking.
A RISC-V can reach (somewhat) more memory in short-opcode mode than a Prop can, but it seems to be very hungry and so is likely to need all that extra memory, to do the same tasks.
Memory = valuable die area at 180nm.
We have to be careful to separate chip capabilities from compiler optimization contortions. So I revisited that old test and set different optimization levels. See below.
Certainly RISC V can make that call with a single instruction, see the "-Os" case.
All that stack messing, the "20 bytes" byte thing, shows up in the optimize for size case "-Os". Now we can see that is a piece of compiler optimization weirdness. I'm not sure why it is doing that and apparently making bigger code than -O3. -Os is supposed to optimize for size! Any way that is leading us astray when thinking about this instruction set.
Edit: Turns out the the resulting binary is exactly the same size for -Os and -O3. So what all that -Os weirdness is about is beyond me.
The fact that you see no calls is good. No calls is quicker:) In lining is generally considered a good thing. If you don't want it it's easily disabled.
What do you mean my "duplicate...16 bytes"? The function may well be in lined but it still has to be there as a callable function. It is globally available. If I make it local by adding the "static" attribute to it then it is omitted from the binary.
There is no "hint of the CALL overheads lurking" anywhere as far as I can tell. Rather than "hungry" it's looking very lean to me. Especially compared to x86 or prop-gcc LMM mode.
Now, you are right, "Memory = valuable die area at 180nm." RISCV is unlikely to achieve code density of Propeller COG code what with the Props flags, conditional execution, bit testing and twiddling instructions etc etc. But then the Prop sacrifices all hope of running bigger programs efficiently.
So, next up I might be comparing RISCV with prop-gcc COG mode code generation.
Aside: I tried increasing the junk arrays by another factor 10 just for fun. Turns out GCC for Intel can't compile it! See below. RISCV was OK though.
$ gcc -Wall -O3 -o long long.c
/tmp/ccfQxC7h.o: In function `useJunk':
longlong1.c:(.text+0xe): relocation truncated to fit: R_X86_64_PC32 against symbol `junk2' defined in COMMON section in /tmp/ccfQxC7h.o
/tmp/ccfQxC7h.o: In function `main':
longlong1.c:(.text.startup+0x26): relocation truncated to fit: R_X86_64_PC32 against symbol `junk2' defined in COMMON section in /tmp/ccfQxC7h.o
collect2: error: ld returned 1 exit status
Heater, maybe you can compile something simple that gets out of zero land. It is a special case, r0 always is zero. Other values, and sums would show the extra coupla instructions needed to handle values.
Funny, I'm noting a lot of these ideas in PASM. Or there are parallels at the least. And we made some of the same arguments early on too, particularly those comparing 8 bitters and their often very tight code.
Anyone remember the complications in the P2 design with pipe lining and states? Seems to me, a lot of that could have been much simpler. In RISC V it is. I'm not saying it's better or worse, just that there are some dynamics here that aren't obvious.
But... getting that tight code is often people writing it. A 6809 can do amazing things with a human programmer. It's a whole lot harder to see a compiler do it. Many instructions and modes go unused. Not to mention things like abusing the stack to move memory on the 6809...
And we've found PASM can do a lot, and as we got better at it, the code got tighter too.
A design like this isn't as human friendly, and is actually obtuse at times. But... as we saw with MIPS and ALPHA, compilers can really squeeze a lot out of it!
Isn't this exactly the argument for years! Who wants to write assembly?
Then there is the whole idea of doing something like SPIN. Some planning would result in a mixed model that could be very efficient.
That rather convoluted argument only really applies where a device might "arrange the instructions order ", ie a device with something like
F Standard Extension for Single-Precision Floating-Point
or the
D Standard Extension for Double-Precision Floating-Point
could do that.
I admit that I have no idea what the above means. In any case, I don't see the problem with the reasoning about having a global flags register vs. not having one: A superscalar architecture will get into trouble if there is one. So it's cleaner and simpler to just test explicitly for any flag you need, where and when you need it, instead of having flags set automatically (where a) you would mostly not need them and b) it complicates the hardware architecture. I can't see how it can be done without slowing something down).
It's really just superscalar vs. not superscalar.
and that is certainly nowhere near a P2 COG replacement, or even any minion processor use.
The more I look at RISC-V, the more compromised it appears, for deeply embedded, small core work. Not something Parallax should invest time in, or even worry about affecting their market.
I still don't see that using an explicit compare instruction to determine if there is a carry (or something else) when and where you need it, is much of a compromise. If that is still what the argument is -- I'm not sure anymore..
(Anyway, as PIC32 is also MIPS the same will apply there. MIPS doesn't seem so different from RISC-V in the general case either. But I'm not finished with studying the ISA yet. I'll have to study alternatives anyway, for reasons mentioned in another thread.)
What do you mean my "duplicate...16 bytes"? The function may well be in lined but it still has to be there as a callable function. It is globally available. If I make it local by adding the "static" attribute to it then it is omitted from the binary.
There is no "hint of the CALL overheads lurking" anywhere as far as I can tell. Rather than "hungry" it's looking very lean to me. Especially compared to x86 or prop-gcc LMM mode.
Err, unless their optimisation is random/broken, they have chosen NOT to call, but instead in inline 16bytes that is only a call away?!
Maybe I am reading too much into what their tools do, but a sane tool would NOT do that, if the call was smaller. Size matters.
The fact that you see no calls is good. No calls is quicker In lining is generally considered a good thing. If you don't want it it's easily disabled.
hehe, not when the objective is to benchmark calling overhead.
Everyone knows inline is good for very small functions, especially called once only. That is not testing the core. that is avoiding the problem.
Now, you are right, "Memory = valuable die area at 180nm." RISCV is unlikely to achieve code density of Propeller COG code what with the Props flags, conditional execution, bit testing and twiddling instructions etc etc. But then the Prop sacrifices all hope of running bigger programs efficiently.
The Prop 2, or Prop 1 ?
I could see a RISC-V meshing nicely with a Spansion HyperRAM, where you get 64Mb in a 5mm package on a 12 pin interface. Code size becomes mostly a 'don't care' in that case.
A small, lockable cache could be used for cases where speed matters most.
Users of such a system would expect interrupts, and that seems to be another ' tbf extension' ....
Heater, maybe you can compile something simple that gets out of zero land. It is a special case, r0 always is zero. Other values, and sums would show the extra coupla instructions needed to handle values.
LOL, good idea. Sounds like whats really needed is ... (wait for it)..... an FFT comparison.
Good grief koehler, you know me too well. Give me a moment already.
However I suspect the test that would make jmg very happy is probably something involving lot's of fiddly bit twiddling an bit testing. But then there is no way to write that nicely in C so it's hard to make a comparison.
Clearly neither the RISC V or any other "normal" instruction set architecture is going to beat PASM in the confines of a COG. Effectively COG code has all it's data in it's unique and abnormally huge register file (496) all the time so there is never any need to load and store data or save/restore registers to stack.
The uncalled function is 4 instructions for -O3 and -Os
..unless their optimisation is random/broken, they have chosen NOT to call, but instead in inline 16bytes that is only a call away?!
Over all I'd say the optimizer was doing a great job. I ask for small code I get code that is less than half unoptimized size. I ask for fast code and, well I have no idea I don't have a machine to run it on.
Now, you might be right in saying that the optimize for size is a bit broken. It makes slightly bigger code than optimize for speed!
Maybe I am reading too much into what their tools do, but a sane tool would NOT do that, if the call was smaller. Size matters.
You might also be right in saying it could have done better by just calling the 4 instruction function. Well, if I disable in lining (-fno-inline-functions) I get the fourth result in the table above, it's bigger. The compiler made the right choice to in line.
, not when the objective is to benchmark calling overhead.
So far we are not building a calling over head benchmark. I'm only poking around to see what the ISA looks like and what the compiler does with it.
Users of such a system would expect interrupts, and that seems to be another ' tbf extension' .
Be sure RISC V has interrupts.To be clear it has exactly 1 specified in the base level (user mode) instruction set. A timer interrupt.
Interrupts are horrible things that application code never sees. Interrupts and hardware interfaces are down there in the OS. This leads us to the whole world of "privileged" instructions and features that are required to deal with interrupts, input/output, along with handling virtual memory, multiple cores, virtualization, etc etc.
They have pushed all that mess into another layer of specification that is still a work in progress and is being finalized with input from the many stake holders in the RISCV project. Should be ready about now.
They are listening to compiler writers and operating system builders so as to determine the best features to include and how to interface to them. Unlike all previous processors where all that is slung together by hardware designers and causes people who have to create software for it to have to complain about it for decades later.
Now, one might say, that is all Smile, I want direct hardware access for my embedded system. We'll see.
Currently if we memory map our I/O we basically have an interrupt free COG like system. Sweet
The Prop ends up using 25% more instructions. What a wast of precious COG space.
These are the only comparisons we should care about, not how RISC V compares to the x86 ISA or ARM's Thumb-2. What we really need to know is whether it's more efficient than P2 ISA, both in terms of code size and in terms of performance (speed, power consumption, etc).
Regardless of that answer, I think no one expects the P2 to be re-architected to use RISC V. However, it might still prove valuable for Parallax to bring out a Propeller-inspired RISC V chip some time in the (distant) future. After the P2 is released.
I quite agree with you Seairth, ultimately we don't care what x86 and ARM and others do, we care what the P1 does and the P2 will do.
The RISC V is only a description of an instruction set. It says nothing about implementation. Nothing about clocks per instruction or pipelines or out of order execution or caches or whatever a CPU builder might so. So it's kind of a fools game to be making comparisons with the Prop. But I find it interesting how the choice of instructions necessarily has a big effect on code size no matter how the chip implements it. One might say that code size is irrelevant today when we have such huge and cheap RAM spaces available but I think it is still critical in "normal" processors so to relive stress on the high speed caches. Big code means more cache misses and huge slowdowns. Bigger caches means more silicon wasted not doing actual useful work and more power consumption.
Certainly I don't seriously suggest Chip throw the Propeller design away and start again with RISCV, or anything else. Not only is it way to late to think of that but we need some measure of compatibility with all that existing P1 PASM code out there!
Basically I'm just idling the time away playing with the idea whilst we are waiting on the P2, after all the question here was: "Is it time to re-examine the P2 requirements"
Oh, yeah, if I did not say it already my answer to that question is "No!. Just get me the P2 chip already!"
Even though it's unlikely and maybe not even desirable to replace the P2 COG with a RISC V core, it might be interesting to consider adding a RISC V core to a P2 chip and then remove all of the tricky hub execution logic from the P2 COG. The RISC V could run the "application code" and the COG could do what it's best at doing. Using a RISC V core would make more sense than adding an ARM core since there wouldn't be any licensing issues or fees associated with RISC V as I understand it. Still probably won't happen though.
That sounds a lot like the Sitara ARM with PRU's(cogs), TI already has one with 4 PRU's or some of the industrial strength MCU offerings from Freescale with eTPU's.
It rapidly get's out of hand though as such a big honking Linux running subsystem will need USB and Ethernet and ... and whatever other things are expected of such a thing.
I forgot to mention, PASM is the easiest assembly language ever to code by hand. Especially as it's it's nicely integrated with Spin source code. I think this is a very valuable thing from the point of view of the educational target market Parallax has in mind. I often think CS and EE undergraduate courses in programming should use a Propeller rather than some stupid home made instruction set and simulator as they often do.
Comments
OK, in an attempt to make the "reach" bigger how about this code: That puts x, y and z in the middle of a huge data space.
Here is what we get for 64 bit compiles:
RISC V: x86: 24 bytes worth of code each.
I'm not sure what to make of this but it seems it's neck and neck.
If I was a chip or SoC vendor I could be confident to move ahead with RISC V and not suffer embarrassing performance hits. Which is what I suspect might happen.
I'm no RISC-V ASM expert, but that code looks incomplete.
It seems to still use 12b relative address to a preloaded (gp) and I'm not sure quite what the (sp) is doing, but that likely has a preload too. ie you need to include all code that loads (gp) and (sp)
Intel uses 7 byte opcodes, but given their wide-fetch paths these days, that may have little speed cost
Here is the entirety of main(): As you see all I missed out was the call/jump to useJunk().
It's a mystery to me how that all works as well
Of course RISC V and anyone else can have wide fetch paths if they want. But you have to admit, minimizing the amount of redundant stuff going to and from memory can only be good.
Strange, the RISC-V spec pdf I have, has no mention of (sp) or (gp) ( - another wry smile about ''standards'....)
Hey Heater,
PBASIC, Spin and eventually C. I think we're using the Chrome serial port FTDI driver. As you know, I don't know a lot about this project except that we've got the best developers on it (from Iced Dev), that Jeff Martin is managing it, we're seriously invested in the project, and that we'll be sharing the project each step of the way. So, while this is a non-answer of sorts, count on real information coming soon from people who know what they're talking about.
Ken Gracey
The (sp) above has stack like code, but I'm not sure why - I think the (sp) code inside the add is simply plonked there, and does not seem to be add-related.
Maybe the C code busy doing preserve the ret from the call to main, by building a stack in SW. Ouch.
Data mentions a register return address x1/ra
If we strip the add code, it seem to be saving ra @ Sp-16(+8?), (as jal will need ra), then restores ra and (sp). The 8 bit Micro I use, which has a Stack, can do a near-call in two bytes. (not 20 ?!)
Stack in software.
This is Reduced Instruction in play after all.
You leave that stuff to the compiler. It can interleave things, find commonalities, and maximize both the instruction pipe (out of order, if it is like MIPS V) and caches.
But when you have CPUs where multiple instructions are executed in parallel, how do you set and keep the values for the status register? It seems to be a much cleaner and simpler approach to not set any such flags (which is the option MIPS, Alpha and RISC V went for). Then, when you need a flag, you use e.g. the STLU instruction to operate on specific registers and leave the result in a specific register. This doesn't collide with anything else a superscalar might be doing - either the compiler will have to arrange the instructions order so that registers are not used at the same time, or the CPU itself re-orders it so. But if you have a central status register where (traditionally) nearly every operation would set one or more flags, then there's no way out - you will either have to remove the parallelism, or have some kind of shadow status register copies hanging around that can be matched to the right operation. Which is probably what modern superscalar CPUs with status registers do. Sounds very complicated to me.
The difference can maybe be boiled down to 'setting flags on (nearly) every operation' and 'setting flags only when you need them'.
-Tor
F Standard Extension for Single-Precision Floating-Point
or the
D Standard Extension for Double-Precision Floating-Point
could do that.
and that is certainly nowhere near a P2 COG replacement, or even any minion processor use.
The more I look at RISC-V, the more compromised it appears, for deeply embedded, small core work.
Not something Parallax should invest time in, or even worry about affecting their market.
At risk of boring everybody except compiler nerds I rearranged my test a bit so as to isolate the test case and make sure our variables were spread far and wide.
Intel x86 code is 40% bigger than RISC V!
Notice how the test() function is in lined in both cases but it's code is still included as its a global function.
I have no idea how you arrive at 20 bytes for that near call. If I turn off optimization that near call is one instruction, 4 bytes. With optimization on it's zero bytes!
Seem the RISC V is quite well suited to low memory embedded devices. What do you mean by "compromised".
Next up I will try and get results for ARM to compare against.
x86: 71 + 32 = 103 bytes total.
RISC V: 52 + 20 = 72 bytes total
That was what I pasted in #400, == 20 bytes.
In your latest listing, that SW stack seems to have vanished, but so too has the jal
I also see no calls, so it is not clear which fruit we are now looking at, but I think neither apples nor oranges.
Zero is also a special case, so gives different RISC-V code size than other values.
In-line of tiny-functions is not a real-world test, but the fact they have chosen to duplicate that much, (16 bytes!) gives a hint of the CALL overheads lurking.
A RISC-V can reach (somewhat) more memory in short-opcode mode than a Prop can, but it seems to be very hungry and so is likely to need all that extra memory, to do the same tasks.
Memory = valuable die area at 180nm.
We have to be careful to separate chip capabilities from compiler optimization contortions. So I revisited that old test and set different optimization levels. See below.
Certainly RISC V can make that call with a single instruction, see the "-Os" case.
All that stack messing, the "20 bytes" byte thing, shows up in the optimize for size case "-Os". Now we can see that is a piece of compiler optimization weirdness. I'm not sure why it is doing that and apparently making bigger code than -O3. -Os is supposed to optimize for size! Any way that is leading us astray when thinking about this instruction set.
Edit: Turns out the the resulting binary is exactly the same size for -Os and -O3. So what all that -Os weirdness is about is beyond me.
The fact that you see no calls is good. No calls is quicker:) In lining is generally considered a good thing. If you don't want it it's easily disabled.
What do you mean my "duplicate...16 bytes"? The function may well be in lined but it still has to be there as a callable function. It is globally available. If I make it local by adding the "static" attribute to it then it is omitted from the binary.
There is no "hint of the CALL overheads lurking" anywhere as far as I can tell. Rather than "hungry" it's looking very lean to me. Especially compared to x86 or prop-gcc LMM mode.
Now, you are right, "Memory = valuable die area at 180nm." RISCV is unlikely to achieve code density of Propeller COG code what with the Props flags, conditional execution, bit testing and twiddling instructions etc etc. But then the Prop sacrifices all hope of running bigger programs efficiently.
So, next up I might be comparing RISCV with prop-gcc COG mode code generation.
No optimization -O0:
Optimization -O3, the call to useJunk() is inlined,
Optimization -Os:
Aside: I tried increasing the junk arrays by another factor 10 just for fun. Turns out GCC for Intel can't compile it! See below. RISCV was OK though.
Heater, maybe you can compile something simple that gets out of zero land. It is a special case, r0 always is zero. Other values, and sums would show the extra coupla instructions needed to handle values.
Funny, I'm noting a lot of these ideas in PASM. Or there are parallels at the least. And we made some of the same arguments early on too, particularly those comparing 8 bitters and their often very tight code.
Anyone remember the complications in the P2 design with pipe lining and states? Seems to me, a lot of that could have been much simpler. In RISC V it is. I'm not saying it's better or worse, just that there are some dynamics here that aren't obvious.
But... getting that tight code is often people writing it. A 6809 can do amazing things with a human programmer. It's a whole lot harder to see a compiler do it. Many instructions and modes go unused. Not to mention things like abusing the stack to move memory on the 6809...
And we've found PASM can do a lot, and as we got better at it, the code got tighter too.
A design like this isn't as human friendly, and is actually obtuse at times. But... as we saw with MIPS and ALPHA, compilers can really squeeze a lot out of it!
Isn't this exactly the argument for years! Who wants to write assembly?
Then there is the whole idea of doing something like SPIN. Some planning would result in a mixed model that could be very efficient.
It's really just superscalar vs. not superscalar.
I still don't see that using an explicit compare instruction to determine if there is a carry (or something else) when and where you need it, is much of a compromise. If that is still what the argument is -- I'm not sure anymore..
(Anyway, as PIC32 is also MIPS the same will apply there. MIPS doesn't seem so different from RISC-V in the general case either. But I'm not finished with studying the ISA yet. I'll have to study alternatives anyway, for reasons mentioned in another thread.)
-Tor
Maybe I am reading too much into what their tools do, but a sane tool would NOT do that, if the call was smaller. Size matters.
hehe, not when the objective is to benchmark calling overhead.
Everyone knows inline is good for very small functions, especially called once only. That is not testing the core. that is avoiding the problem.
I could see a RISC-V meshing nicely with a Spansion HyperRAM, where you get 64Mb in a 5mm package on a 12 pin interface. Code size becomes mostly a 'don't care' in that case.
A small, lockable cache could be used for cases where speed matters most.
Users of such a system would expect interrupts, and that seems to be another ' tbf extension' ....
Trying to tease @Heater.?
Enjoy!
Mike
However I suspect the test that would make jmg very happy is probably something involving lot's of fiddly bit twiddling an bit testing. But then there is no way to write that nicely in C so it's hard to make a comparison.
Clearly neither the RISC V or any other "normal" instruction set architecture is going to beat PASM in the confines of a COG. Effectively COG code has all it's data in it's unique and abnormally huge register file (496) all the time so there is never any need to load and store data or save/restore registers to stack.
Here are the code sizes for the various optimization levels when compiling that original test: The uncalled function is 4 instructions for -O3 and -Os Over all I'd say the optimizer was doing a great job. I ask for small code I get code that is less than half unoptimized size. I ask for fast code and, well I have no idea I don't have a machine to run it on.
Now, you might be right in saying that the optimize for size is a bit broken. It makes slightly bigger code than optimize for speed! You might also be right in saying it could have done better by just calling the 4 instruction function. Well, if I disable in lining (-fno-inline-functions) I get the fourth result in the table above, it's bigger. The compiler made the right choice to in line. So far we are not building a calling over head benchmark. I'm only poking around to see what the ISA looks like and what the compiler does with it.
Interrupts are horrible things that application code never sees. Interrupts and hardware interfaces are down there in the OS. This leads us to the whole world of "privileged" instructions and features that are required to deal with interrupts, input/output, along with handling virtual memory, multiple cores, virtualization, etc etc.
They have pushed all that mess into another layer of specification that is still a work in progress and is being finalized with input from the many stake holders in the RISCV project. Should be ready about now.
They are listening to compiler writers and operating system builders so as to determine the best features to include and how to interface to them. Unlike all previous processors where all that is slung together by hardware designers and causes people who have to create software for it to have to complain about it for decades later.
Now, one might say, that is all Smile, I want direct hardware access for my embedded system. We'll see.
Currently if we memory map our I/O we basically have an interrupt free COG like system. Sweet
A good test for that is the recursive fibonacci. fibo.c in the propgcc demos.
I won't bore you with showing the binary produced just the code size results:
RISCV 32 bit @-Os : 25 instructions
Propeller COG mode @ -Os : 31 instructions
The Prop ends up using 25% more instructions. What a wast of precious COG space.
Seems suggesting a Propeller like device adopt the RISCV instruction set architecture is not so silly after all
Results attached for compiler nerds.
These are the only comparisons we should care about, not how RISC V compares to the x86 ISA or ARM's Thumb-2. What we really need to know is whether it's more efficient than P2 ISA, both in terms of code size and in terms of performance (speed, power consumption, etc).
Regardless of that answer, I think no one expects the P2 to be re-architected to use RISC V. However, it might still prove valuable for Parallax to bring out a Propeller-inspired RISC V chip some time in the (distant) future. After the P2 is released.
The RISC V is only a description of an instruction set. It says nothing about implementation. Nothing about clocks per instruction or pipelines or out of order execution or caches or whatever a CPU builder might so. So it's kind of a fools game to be making comparisons with the Prop. But I find it interesting how the choice of instructions necessarily has a big effect on code size no matter how the chip implements it. One might say that code size is irrelevant today when we have such huge and cheap RAM spaces available but I think it is still critical in "normal" processors so to relive stress on the high speed caches. Big code means more cache misses and huge slowdowns. Bigger caches means more silicon wasted not doing actual useful work and more power consumption.
Certainly I don't seriously suggest Chip throw the Propeller design away and start again with RISCV, or anything else. Not only is it way to late to think of that but we need some measure of compatibility with all that existing P1 PASM code out there!
Basically I'm just idling the time away playing with the idea whilst we are waiting on the P2, after all the question here was: "Is it time to re-examine the P2 requirements"
Oh, yeah, if I did not say it already my answer to that question is "No!. Just get me the P2 chip already!"
That sounds a lot like the Sitara ARM with PRU's(cogs), TI already has one with 4 PRU's or some of the industrial strength MCU offerings from Freescale with eTPU's.
Yes, yes, yes. That would be wonderful.
It rapidly get's out of hand though as such a big honking Linux running subsystem will need USB and Ethernet and ... and whatever other things are expected of such a thing.
I forgot to mention, PASM is the easiest assembly language ever to code by hand. Especially as it's it's nicely integrated with Spin source code. I think this is a very valuable thing from the point of view of the educational target market Parallax has in mind. I often think CS and EE undergraduate courses in programming should use a Propeller rather than some stupid home made instruction set and simulator as they often do.