heater said...
Besides, cogs are presious. What's a few K of HUB RAM?
Ya, I understand and it is a good motivational factor to provide such emulation.
People really need to learn how to code without floating point though.
Still, some things have been expected to require a co-processor [noparse]:)[/noparse]
I was always annoyed that the Spin Floating point thingy requires 2 COGs.
The SpinLMM demo program already contained all of the components that were needed, so I just had to convert the Euler C program to Spin to implement it.· The program uses 3,268 bytes, which includes the floating point routines, the serial output, the float-to-string routines, dbprintf and the SpinLMM interpreter.
The top object is shown below.· The top object is 212 bytes, or 146 bytes without the print string.
Dave
CON
_clkmode = xtal1 + pll16x
_xinfreq = 5_000_000
OBJ
lmm : "SpinLMM"
ser : "HalfDuplexSerial"
f : "float_lmm"
fstr: "floatstr"
DAT
terms long 1000
es long 0
PUB main | cycles, outstr[noparse][[/noparse]20]
waitcnt(clkfreq+cnt)
lmm.start
cycles := cnt
es := euler_series(terms)
cycles := cnt - cycles
fstr.PutFloatF(@outstr, es, -1, -1)
ser.dbprintf3(string("\nThe first %d terms of the Euler series sum to %s in %d cycles\n"), terms, @outstr, cycles)
PUB euler_term(term) | val
val := f.fdiv(1.0, f.fmul(f.ffloat(term), f.ffloat(term)))
return val
PUB euler_series(nterms) | term
repeat term from 1 to nterms
result := f.fadd(result, euler_term(term))
Post Edited (Dave Hein) : 7/27/2010 8:20:02 PM GMT
Jazzed: Horses for courses I guess. Sometimes there may be COGs free for FP when you need the speed sometimes there maybe RAM free for FP and you need the COG speed elsewhere.
Possibly, maybe, one day, ZOG will get an FP coprocessor, unlikely unless I can borrow one that won't take too much work to convert to a C/C++ interface.
Never felt the need for FP in embedded apps myself. I always quote my old boss "If you think you need floating point then you don't understand the problem". But as FP is a C language feature we should give it a go. I have not even checked that ZOG's FP works yet[noparse]:)[/noparse]
Dave: That's amazing! What result did it spit out?
Damn ugly code though, I would not want to have to do a lot of FP code in that[noparse]:)[/noparse]
I hope to have a time for ZOGs Euler Series tomorrow.
I just realized how easy it is to get the ZPU libraries stdio to work through my C++ FullDuplexSerial. Just write inbyte and outbyte routines in the app. They will override the C run time versions which have "weak linkage". In those overrides just use FDX.tx() and FDX.rx(). This way console I/O can be redirected to any place the application desires.
With that working I can run virgin, unhacked original benchmark codes. Well apart from the fact that printf won't fit in HUB.
Is anyone keeping a table of results here?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
binary file size:3816 bytes (no printing) or 17k (printing!)
code size: 288 bytes
library size: 0 (Catalina has floating point built-in to the kernel - NO EXTRA COGS REQUIRED).
execution time: 0.057
The result: Catalina comes in UNDER 2x the size of SpinLMM and 8x the speed</font>
I hereby declare Catalina the undisputed champion (at least for anything requiring floating point) !!!!
@heater, the result I got was 1.643935, which is the same result I got running on a PC.
@Ross, those are very impressive results.· It pays to have the floating point routines running native PASM code in the same cog.· Overall, it seems like the Catalina compiler is the right approach as long as the code fits in 32K.
Thanks. Good floating point performance was one of Catalina's original design criteria.
However, there's more - executing from XMM RAM, Catalina runs the same program in 0.34 seconds - and this is on a DRACBLADE, which has quite slow XMM RAM.
So while I agree about Catalina being best suited for cases where the code fits into 32k, if you have XMM RAM available then Catalina can execute floating point code from XMM RAM faster than SPIN/LMM can execute it from Hub RAM!
I've not done the timing on a RamBlade yet (which has faster XMM RAM), but now that we have benchmarks, I think I will be able to show that Catalina can execute just about any code from XMM RAM faster than a byte coded solution can execute from Hub RAM.
RossH said...
I think I will be able to show that Catalina can execute just about any code from XMM RAM faster than a byte coded solution can execute from Hub RAM.
That's pretty exciting! And some day that RAM will be built-in. Until then we hack and hack and hack.
RossH: Fantastic result there. The fact that you are getting floating point with NO code size overhead. (I bet that Euler function is pretty much the same size if you change all the floats to ints) is impressive enough. The speed is awesome, I guess normal Spin + FP object is not that quick and wastes a COG or to as well.
I'm not sure about this EXT RAM business. Basically Catalina on TriBlade #2 / RamBlade is just C on an MCU with no pins available and idle COGS that can't be used. So it is a poor choice compared to many other MCUs. BUT on DracBlade style boards, may be slower, there is still a lot of Prop goodness available to be used, pins, cogs, video etc.
Jazzed:
A Java comparison would be interesting, given that ZOG is bytecode based as well.
Zog XMM is a bit of a disaster at the moment. Ever since I separated the ZPU interpreter in to it's own module (so that it's PASM blob is independent of Spin) and redirected the I/O through a mailbox to FullDuplexSerial (Rather than relying on the starter COG to do I/O) the XMM version has been broken.
I'm also getting rid of the idea of ZOG having memory mapped I/O, like UART, which then have to be trapped and handled somewhere in the interpreter. Zog will now talk to peripherals through memory as is done for normal Spin objects. OR it will uses SYSCALL which gets trapped and handled by a starter COG if there is one. Now that I know how to redirect stuff this becomes a neat solution for running a ZOG in in HUB or XMM which can be single step debugged from a parent COG.
Also I was kind of waiting on Bill pushing out a new VMCog.
Anyway Zog XMM is returning soon.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I've been kind of busy working on a consulting gig and fighting with the new PropCade (and new LANuSD boards) - but I got a new shipment of 74HC138's in, so I can replace the (suspect) one in PropCade, and hopefully resume on VMCOG today.
On PropCade, all the circuitry traces out correctly, and it passes the fill test - but fails the "heater" test miserably. I've managed to trace it down to the SPI RAM's not getting initialized into "sequential" mode... the problem is I don't see why they are not getting initialized.
As far as LANuSD goes, the uSD part works great, the lan part blinks the LED's on the FastJack, but does not respond. I've traced out all the circuitry, and it looks fine - the only change from the last time I used the Enc28j60 is a change on the power on reset (as I recall) - so that is what I am attacking next.
Yesterday I ordered a nice portable dual channel 100MHz bandwidth scope with 7" 800x480 display - it samples at up to 1Gs/sec, and has a 4M sample buffer [noparse]:)[/noparse] this should help me find higher frequency analog issues should I have any, and let me carefully watch the rise/fall times for problems as outlined above. I should get it in about 10 days... should be WAY better than my old analog scopes, or even the nice new shiny PropScope Ken was kind enough to send me [noparse]:)[/noparse]
heater said...
RossH: Fantastic result there. The fact that you are getting floating point with NO code size overhead. (I bet that Euler function is pretty much the same size if you change all the floats to ints) is impressive enough. The speed is awesome, I guess normal Spin + FP object is not that quick and wastes a COG or to as well.
I'm not sure about this EXT RAM business. Basically Catalina on TriBlade #2 / RamBlade is just C on an MCU with no pins available and idle COGS that can't be used. So it is a poor choice compared to many other MCUs. BUT on DracBlade style boards, may be slower, there is still a lot of Prop goodness available to be used, pins, cogs, video etc.
Jazzed:
A Java comparison would be interesting, given that ZOG is bytecode based as well.
Zog XMM is a bit of a disaster at the moment. Ever since I separated the ZPU interpreter in to it's own module (so that it's PASM blob is independent of Spin) and redirected the I/O through a mailbox to FullDuplexSerial (Rather than relying on the starter COG to do I/O) the XMM version has been broken.
I'm also getting rid of the idea of ZOG having memory mapped I/O, like UART, which then have to be trapped and handled somewhere in the interpreter. Zog will now talk to peripherals through memory as is done for normal Spin objects. OR it will uses SYSCALL which gets trapped and handled by a starter COG if there is one. Now that I know how to redirect stuff this becomes a neat solution for running a ZOG in in HUB or XMM which can be single step debugged from a parent COG.
Also I was kind of waiting on Bill pushing out a new VMCog.
a) I start with a most basic program that can output something via FullDuplexSerial. It's Zog binary is binary is 3980 bytes.
b) I add three float variables and one line of code that applies add, subtract, multiply, divide to them. It's binary is...wait for it ...17736 bytes! That's nearly 14K for one line of code!
c) I look in the Catalina kernel's PASM and weep. RossH supports FP add, sub, mul, div with 258 lines of code, including comments.
How long does it take Zog to run the euler_series test? I'll let you know when it finishes[noparse]:)[/noparse]
Now, how can I get that PASM FP support into Zog's interpreter? Fore sure I don't think there is enough room. And how to hack the ZPU libs to use it? Could be done via the ZPU's SYSCALL instruction. Or perhaps run them from an LMM engine within Zog. Hmmm....
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
It seems that a program could be optimized by selecting the appropriate version of the LLM interpreter.· Programs that do a lot of floating poiint could be compiled with a library and an interpreter that favors floating point operations.· An integer DSP program could use an interpreter that favors integer MACs and other DSP operations.· Different interpreters and library functions could be developed to handle various types of applications.
I was quite happy to have Zog go without floating point, having argued for years that little MCU applications rarely need it. But oh no, I had to put up a floating point benchmark. Now Ross can't stop laughing long enough to post here anymore[noparse]:)[/noparse]
I can't weedle out of it either, having had the debate with Ross about what is in the "language" and what is the "library" baggage. Clearly float is in the language syntax/semantics.
There are 75 or so longs left in the Zog interpreter Cog. I'm prepared to try this, as an experiment:
a) Put the smallest most simple LMM kernel loop into Zog. Or perhaps an unrolled loop if I'm feeling generous with LONGs.
b) Arrange that the ZPU SYSCALL instruction can be used to get to the LMM float operations, with various syscall id numbers.
Now, this already takes a bit of juggling. You see currently Zog's dispatch table is floating around in HUB memory. You have to tell Zog where it is through a PAR block at start up. This is so that we can move it out of the way of the bulk of HUB RAM when running.
Similarly, any LMM code will have tenuous linkage with the interpreter. Again the PAR block will have to tell Zog where to find the float code. Then the float LMM code has to exit to a known, fixed location within COG when it is done.
All this only effectively gives Zog's C access to float operations through a function interface. Not actually via "+", "-", "*", and "/" operators. That requires hacking the ZPU C runtime support library to use the new syscalls instead of the current routines. That may or may not ever happen.
Still we can define an Integer class in C++ and override the arithmetic operators for it to use the syscalls. Then things will look OK[noparse]:)[/noparse]
Bill: Any ideas about making those LMM float routines relocatable? Or do I have to do it as I described?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
1) Since the float routines will be short, simply use "branch" instructions, and not jumps - ie 'sub pc,#(inst+1)*4' to go backwards, 'add pc,#(inst-1)*4' to go forwards, going up to 128 instructions back, or 127 forward
2) use LMM code for the 'setup' of the routines, which can use 'branching', but put the loops into FCACHE blocks - whose address you will know, due to knowing where your FCACHE buffer is
Note, you would not lose much performance by moving integer multiply/divide/remaineder out to lmm code (that fcaches loops)
Also, note that 'branch' instructions can be used to implement subroutines:
mov savepc, pc
add pc,#offset
when the routine wants to return:
add savepc,#4
mov pc,savepc
Of course, you would have to manage your own nesting of relative subroutine calls - however if subroutines are single level, there is no need to keep a stack.
Maybe you could use the·LMM interpreter·from SpinLMM ( http://obex.parallax.com/objects/635/·).· It only uses 51 longs, and 16 of the longs are used for the FCACHE area.· With 70 longs you could add more space to the FCACHE area, or you could add a few more psuedo-ops.· This would allow you to run the floating point LMM code that I used with SpinLMM.
As Bill said, you can make jumps in the LMM code position-independent by adding or subtracting an offset to/from the program counter.
Dave
Post Edited (Dave Hein) : 7/29/2010 10:07:11 PM GMT
I guess that the 14K extra code includes all the floating point math functions, not just the ones you used. Could you get a smarter linker to only link the functions used?
Sorry, been busy working on my optimizer. However, I've just spent some time re-running all the benchmarks so far but from XMM RAM. Here are the results:
For the integer benchmarks (Drhystone 1.1, xxtea), Catalina runs from XMM RAM about 6 times slower than from Hub RAM. Since Catalina was 4 times faster than Zog and CSPIN, this means Catalina executing C programs from XMM RAM is about 50% slower than SPIN executing from Hub RAM on these compute intensive integer programs.
However, for the floating point benchmark (euler), Catalina runs from XMM RAM only 3 times slower than from Hub RAM. Since Catalina was 8 times faster than LMMSpin, this means Catalina runs compute intensive floating point programs around 2.5x faster from XMM RAM than LMMSpin does from Hub RAM.
I think that for more general purpose programs (i.e. those that include a mix of integer, floating point, plus other language features that are not included in these particular benchmarks), it is quite possible that Catalina may come in around 2.5x the code size of the equivalent byte code program, and be able to execute the programs from XMM RAM at around the same speed as SPIN byte codes from Hub RAM.
An example that might show this would be the Whetstone benchmark, so I have included a copy, as well as Catalina's results when running from XMM (below). I don't think the results are particularly impressive, but I think it would be interesting to compare them with CSPIN or Zog running the same code from Hub RAM.
The Whetstone program I used is included - I had to make a minor source code change to eliminate some DOS specific function calls (nothing to do with the benchmark itself - they were to read characters and to fetch the date).
Here are the results when all floating point functions (sin, cos, sqrt, exp, etc) are implemented in C (i.e. using the Catalina -lm option):
----------------- ----------------------------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
Whetstone Single Precision Benchmark in C/C++
Month run 0/0
PC model RamBlade
CPU Parallax Propeller 1
Clock MHz 6.25Mhz
Cache NIL
H/W Options NIL
OS/DOS NIL
Compiler Catalina 2.6
OptLevel -lcx -lm -x2 -O3 -D RAMBLADE -M256k -D CLOCK
Run by Ross Higson
From Sydney, Australia
E-mail address ross@thevastydeep.com
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12477111816406250 0.042 0.461
N2 floating point -1.12276840209960938 0.047 2.840
N3 if then else 1.00000000000000000 0.089 1.166
N4 fixed point 12.00000000000000000 0.083 3.777
N5 sin,cos etc. 0.49913258552551270 0.002 42.593
N6 floating point 0.99999990463256836 0.024 22.440
N7 assignments 3.00000000000000000 0.033 5.638
N8 exp,sqrt etc. 0.75111975669860840 0.001 29.922
MWIPS 0.092 108.837
Results to load to spreadsheet MWIPS Mflops1 Mflops2 Mflops3 Cosmops Expmops Fixpmops Ifmops Eqmops
Results to load to spreadsheet 0.092 0.042 0.047 0.024 0.002 0.001 0.083 0.089 0.033
Here are the results when using two additional cogs for the floating point functions (sin, cos, sqrt, exp, etc) implemented in PASM (i.e. uising the Catalina -lmb option):
----------------- ----------------------------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
Whetstone Single Precision Benchmark in C/C++
Month run 0/0
PC model RamBlade
CPU Parallax Propeller 1
Clock MHz 6.25Mhz
Cache NIL
H/W Options NIL
OS/DOS NIL
Compiler Catalina 2.6
OptLevel -lcx -lmb -x2 -O3 -D RAMBLADE -M256k -D CLOCK
Run by Ross Higson
From Sydney, Australia
E-mail address ross@thevastydeep.com
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12477111816406250 0.042 0.461
N2 floating point -1.12276840209960938 0.047 2.840
N3 if then else 1.00000000000000000 0.089 1.165
N4 fixed point 12.00000000000000000 0.083 3.777
N5 sin,cos etc. 0.50224490165710449 0.010 8.130
N6 floating point 0.99999990463256836 0.024 22.439
N7 assignments 3.00000000000000000 0.033 5.638
N8 exp,sqrt etc. 0.74965400695800781 0.006 6.183
MWIPS 0.197 50.633
Results to load to spreadsheet MWIPS Mflops1 Mflops2 Mflops3 Cosmops Expmops Fixpmops Ifmops Eqmops
Results to load to spreadsheet 0.197 0.042 0.047 0.024 0.010 0.006 0.083 0.089 0.033
For MWIPS (only) this puts the Prop about on par with a PDP 11/34 (with FP), or slightly below a MicroVax 1.
It is not useful to compare XMM binary file sizes with Hub binary file sizes becasue the format is completely different, but the binary is quite large because the program also includes writing the results to a file, and therefore requires full Catalina file system support as well as maths libary support - this easily blows the 32k hub limit for LMM programs. The total code size for Catalina is 62,896 bytes, with the whetstone code itself (excluding all file and maths library functions) being 11,700 bytes.
It is therefore possible that Zog or CSPIN could fit the entire program into Hub RAM.
Yep, the relative jumps in LMM were clear to me already, that's what ZiCog does in it's LMM.
I was worried about the fact that I want to move the LMM code, say to the top of HUB with the dispatch table, so the start of LMM has to be passed to Zog through PAR so he can find it. Then he has to know the offsets of the routines within that LMM block.
Well, I guess that's not a problem, I'll just put a jump table at the beginning of the LMM block and get to the float routines through that. Then if the code changes for some reason everything will still work.
Then I though there was a problem getting out of LMM back to the interpreter. This morning that does not seem to be an issue[noparse]:)[/noparse]
I'm inclined to skip FCACH, at least initially, keep the thing as small as possible. I like to have free longs in COG. If we ever get to say, "Zog is done", then perhaps any optimisation that needs space can be applied.
I like the LMM subroutine idea.
@Dave:
I'll have a look at the SpinLMM float code. 51 LONGS for the LMM interpreter seems a lot to me, see above comments.
@John Abshier:
Not so, that 14K is what gets pulled in for +, -, *, / operators. I played with this a bit, compile with using just "+", then add "-", then add "*" etc. The binary gets progressively bigger. From 11428 bytes with just "+" to 17736 bytes with all four.
I don't think there are many linkers smarter than the GCC linker out there.
@RossH:
That's a lot of testing you have been doing. Those XMM results are impressive. A few details are missing from the results:
Which XMM board were you using, TriBlade, DracBlade or ...?
Are you keeping data and stack in HUB?
Why is the XMM binary format different for XMM ? Is that because data is still in HUB?
Currently Zog has bee running the exact same binary image from HUB or XMM. Stack and data move to XMM as well. Shall we say "slow".
I don't think Zog will fit Whetstone in HUB. Not until I've done something with the float in LMM and then modified the float libs accordingly. The latter could take a long time to happen.
@potatohead:
Wow, thank you. I don't know what to say. For once.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
All the XMM benchmark testing is on the RamBlade. The memory mode I used is -x2 (which is data and stack in Hub RAM and code in XMM RAM). Catalina also has a -x5 option (data and code in XMM and stack in hub RAM) but that's slightly slower. I can't see the point of ever putting the stack in XMM RAM - it would be way too slow.
The XMM file format is around 32k larger than the equivalent Hub binary format - the first 32k is used for the kernel, drivers and plugins, and all 32k is reserved even if it is mostly empty - the actual program starts after the first 32k. So as soon as you compile with -x2 or -x5 the files immediately get substantially larger even if the code size hasn't changed at all.
Ross.
P.S. The Catalina Optimizer replaces all LMM long jumps with relative jumps throughout the entire program (including library functions) - that's its single biggest code size savings. The next biggest saving often comes from the automatic 'inlining' of function calls, and then there are some other 'tidy up' optimizations to eliminate unnecessary instructions.
I'll have a look at the SpinLMM float code. 51 LONGS for the LMM interpreter seems a lot to me, see above comments.
The 51 longs include a 16-long FCACHE space plus 8 general purpose registers plus a few psuedo-ops.· You could implement·a basic LMM intepreter loop in 4 longs.· You will need a program counter, but you might be able to use a register within your existing interpreter for that.
Dave
Yep, LMM in a tad more than 4 LONGs sounds like just the ticket.
That's what I have in the latest ZiCog Z80 emulator. (Yes that project is still alive.)
Thing is those float functions take 180 longs or so. That's 720 bytes eaten out of HUB RAM. Plus 128 bytes already used by the Zog dispatch table. Call it a K byte. I'll have to make float support optional.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Just discovered that the ZPU libmath does indeed have sin, cos, tan etc. I just had my "-lm" option at the wrong end of the link command invocation in the Makefile so it did not look for them...
Now, I also discovered that the whetstone bench mark code uses "sin", "cos", "tan" etc. Turns out that these are defined as "
double sin(double x);" and so uses double precision floats, 8 bytes for the ZPU GCC. If one wants single precision floats one has to use "sinf", "cosf", "tanf" etc. These operate of on 4 byte floats and give a slightly smaller executable. Presumably a lot faster too.
I added some macros to whetstone like "#define sin sinf" to get the single precision versions. Whetstone binary is 75520 bytes for doubles and 72192 for single precision.
Still no chance to run this from HUB RAM. Even if all the printing was removed it would still be over 32K.
RossH. What sizes is Catalina using for sin, cos, tan etc? Does it support double precision 8 byte floats?
Now I have a challenge on my hands to get Zog to do this from HUB RAM not to mention do it with any usable speed.
The Zylin guys were discussing adding float support to the ZPU instruction set some how a while back. I should try to find out what they are up to.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Comments
Besides, cogs are precious. What's a few K of HUB RAM?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Post Edited (heater) : 7/27/2010 9:02:31 PM GMT
People really need to learn how to code without floating point though.
Still, some things have been expected to require a co-processor [noparse]:)[/noparse]
I was always annoyed that the Spin Floating point thingy requires 2 COGs.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
The SpinLMM demo program already contained all of the components that were needed, so I just had to convert the Euler C program to Spin to implement it.· The program uses 3,268 bytes, which includes the floating point routines, the serial output, the float-to-string routines, dbprintf and the SpinLMM interpreter.
The top object is shown below.· The top object is 212 bytes, or 146 bytes without the print string.
Dave
Post Edited (Dave Hein) : 7/27/2010 8:20:02 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
Possibly, maybe, one day, ZOG will get an FP coprocessor, unlikely unless I can borrow one that won't take too much work to convert to a C/C++ interface.
Never felt the need for FP in embedded apps myself. I always quote my old boss "If you think you need floating point then you don't understand the problem". But as FP is a C language feature we should give it a go. I have not even checked that ZOG's FP works yet[noparse]:)[/noparse]
Dave: That's amazing! What result did it spit out?
Damn ugly code though, I would not want to have to do a lot of FP code in that[noparse]:)[/noparse]
I hope to have a time for ZOGs Euler Series tomorrow.
I just realized how easy it is to get the ZPU libraries stdio to work through my C++ FullDuplexSerial. Just write inbyte and outbyte routines in the app. They will override the C run time versions which have "weak linkage". In those overrides just use FDX.tx() and FDX.rx(). This way console I/O can be redirected to any place the application desires.
With that working I can run virgin, unhacked original benchmark codes. Well apart from the fact that printf won't fit in HUB.
Is anyone keeping a table of results here?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
We overloaded getch and putch for the ICC compiler for various console IO.
Don't even ask me to do all this benchmarking for Propeller JVM [noparse]:)[/noparse] That would be useful in time though.
--Steve
(OT Any luck getting ZOG XMM working again? An update in the ZOG thread would be useful thanks.)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
Results for euler:
binary file size:3816 bytes (no printing) or 17k (printing!)
code size: 288 bytes
library size: 0 (Catalina has floating point built-in to the kernel - NO EXTRA COGS REQUIRED).
execution time: 0.057
The result: Catalina comes in UNDER 2x the size of SpinLMM and 8x the speed</font>
I hereby declare Catalina the undisputed champion (at least for anything requiring floating point) !!!!
Edit: added binary file size
@Ross, those are very impressive results.· It pays to have the floating point routines running native PASM code in the same cog.· Overall, it seems like the Catalina compiler is the right approach as long as the code fits in 32K.
Dave
Thanks. Good floating point performance was one of Catalina's original design criteria.
However, there's more - executing from XMM RAM, Catalina runs the same program in 0.34 seconds - and this is on a DRACBLADE, which has quite slow XMM RAM.
So while I agree about Catalina being best suited for cases where the code fits into 32k, if you have XMM RAM available then Catalina can execute floating point code from XMM RAM faster than SPIN/LMM can execute it from Hub RAM!
I've not done the timing on a RamBlade yet (which has faster XMM RAM), but now that we have benchmarks, I think I will be able to show that Catalina can execute just about any code from XMM RAM faster than a byte coded solution can execute from Hub RAM.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
I'm not sure about this EXT RAM business. Basically Catalina on TriBlade #2 / RamBlade is just C on an MCU with no pins available and idle COGS that can't be used. So it is a poor choice compared to many other MCUs. BUT on DracBlade style boards, may be slower, there is still a lot of Prop goodness available to be used, pins, cogs, video etc.
Jazzed:
A Java comparison would be interesting, given that ZOG is bytecode based as well.
Zog XMM is a bit of a disaster at the moment. Ever since I separated the ZPU interpreter in to it's own module (so that it's PASM blob is independent of Spin) and redirected the I/O through a mailbox to FullDuplexSerial (Rather than relying on the starter COG to do I/O) the XMM version has been broken.
I'm also getting rid of the idea of ZOG having memory mapped I/O, like UART, which then have to be trapped and handled somewhere in the interpreter. Zog will now talk to peripherals through memory as is done for normal Spin objects. OR it will uses SYSCALL which gets trapped and handled by a starter COG if there is one. Now that I know how to redirect stuff this becomes a neat solution for running a ZOG in in HUB or XMM which can be single step debugged from a parent COG.
Also I was kind of waiting on Bill pushing out a new VMCog.
Anyway Zog XMM is returning soon.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Typical Zog behaviour! He emerges from his gigantic block of ice, scares everyone, then retreats and freezes himself again.
www.marvunapp.com/Appendix3/zogjim.htm
Ross.
But, he only emerges from his block of ice when someone disturbs him. Quite harmless really.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I've been kind of busy working on a consulting gig and fighting with the new PropCade (and new LANuSD boards) - but I got a new shipment of 74HC138's in, so I can replace the (suspect) one in PropCade, and hopefully resume on VMCOG today.
On PropCade, all the circuitry traces out correctly, and it passes the fill test - but fails the "heater" test miserably. I've managed to trace it down to the SPI RAM's not getting initialized into "sequential" mode... the problem is I don't see why they are not getting initialized.
As far as LANuSD goes, the uSD part works great, the lan part blinks the LED's on the FastJack, but does not respond. I've traced out all the circuitry, and it looks fine - the only change from the last time I used the Enc28j60 is a change on the power on reset (as I recall) - so that is what I am attacking next.
Yesterday I ordered a nice portable dual channel 100MHz bandwidth scope with 7" 800x480 display - it samples at up to 1Gs/sec, and has a 4M sample buffer [noparse]:)[/noparse] this should help me find higher frequency analog issues should I have any, and let me carefully watch the rise/fall times for problems as outlined above. I should get it in about 10 days... should be WAY better than my old analog scopes, or even the nice new shiny PropScope Ken was kind enough to send me [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 7/28/2010 5:09:35 PM GMT
a) I start with a most basic program that can output something via FullDuplexSerial. It's Zog binary is binary is 3980 bytes.
b) I add three float variables and one line of code that applies add, subtract, multiply, divide to them. It's binary is...wait for it ...17736 bytes! That's nearly 14K for one line of code!
c) I look in the Catalina kernel's PASM and weep. RossH supports FP add, sub, mul, div with 258 lines of code, including comments.
How long does it take Zog to run the euler_series test? I'll let you know when it finishes[noparse]:)[/noparse]
Now, how can I get that PASM FP support into Zog's interpreter? Fore sure I don't think there is enough room. And how to hack the ZPU libs to use it? Could be done via the ZPU's SYSCALL instruction. Or perhaps run them from an LMM engine within Zog. Hmmm....
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
I was quite happy to have Zog go without floating point, having argued for years that little MCU applications rarely need it. But oh no, I had to put up a floating point benchmark. Now Ross can't stop laughing long enough to post here anymore[noparse]:)[/noparse]
I can't weedle out of it either, having had the debate with Ross about what is in the "language" and what is the "library" baggage. Clearly float is in the language syntax/semantics.
There are 75 or so longs left in the Zog interpreter Cog. I'm prepared to try this, as an experiment:
a) Put the smallest most simple LMM kernel loop into Zog. Or perhaps an unrolled loop if I'm feeling generous with LONGs.
b) Arrange that the ZPU SYSCALL instruction can be used to get to the LMM float operations, with various syscall id numbers.
Now, this already takes a bit of juggling. You see currently Zog's dispatch table is floating around in HUB memory. You have to tell Zog where it is through a PAR block at start up. This is so that we can move it out of the way of the bulk of HUB RAM when running.
Similarly, any LMM code will have tenuous linkage with the interpreter. Again the PAR block will have to tell Zog where to find the float code. Then the float LMM code has to exit to a known, fixed location within COG when it is done.
All this only effectively gives Zog's C access to float operations through a function interface. Not actually via "+", "-", "*", and "/" operators. That requires hacking the ZPU C runtime support library to use the new syscalls instead of the current routines. That may or may not ever happen.
Still we can define an Integer class in C++ and override the arithmetic operators for it to use the syscalls. Then things will look OK[noparse]:)[/noparse]
Bill: Any ideas about making those LMM float routines relocatable? Or do I have to do it as I described?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
two solutions exist:
1) Since the float routines will be short, simply use "branch" instructions, and not jumps - ie 'sub pc,#(inst+1)*4' to go backwards, 'add pc,#(inst-1)*4' to go forwards, going up to 128 instructions back, or 127 forward
2) use LMM code for the 'setup' of the routines, which can use 'branching', but put the loops into FCACHE blocks - whose address you will know, due to knowing where your FCACHE buffer is
Note, you would not lose much performance by moving integer multiply/divide/remaineder out to lmm code (that fcaches loops)
Also, note that 'branch' instructions can be used to implement subroutines:
mov savepc, pc
add pc,#offset
when the routine wants to return:
add savepc,#4
mov pc,savepc
Of course, you would have to manage your own nesting of relative subroutine calls - however if subroutines are single level, there is no need to keep a stack.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system
Maybe you could use the·LMM interpreter·from SpinLMM ( http://obex.parallax.com/objects/635/·).· It only uses 51 longs, and 16 of the longs are used for the FCACHE area.· With 70 longs you could add more space to the FCACHE area, or you could add a few more psuedo-ops.· This would allow you to run the floating point LMM code that I used with SpinLMM.
As Bill said, you can make jumps in the LMM code position-independent by adding or subtracting an offset to/from the program counter.
Dave
Post Edited (Dave Hein) : 7/29/2010 10:07:11 PM GMT
John Abshier
Sorry, been busy working on my optimizer. However, I've just spent some time re-running all the benchmarks so far but from XMM RAM. Here are the results:
For the integer benchmarks (Drhystone 1.1, xxtea), Catalina runs from XMM RAM about 6 times slower than from Hub RAM. Since Catalina was 4 times faster than Zog and CSPIN, this means Catalina executing C programs from XMM RAM is about 50% slower than SPIN executing from Hub RAM on these compute intensive integer programs.
However, for the floating point benchmark (euler), Catalina runs from XMM RAM only 3 times slower than from Hub RAM. Since Catalina was 8 times faster than LMMSpin, this means Catalina runs compute intensive floating point programs around 2.5x faster from XMM RAM than LMMSpin does from Hub RAM.
I think that for more general purpose programs (i.e. those that include a mix of integer, floating point, plus other language features that are not included in these particular benchmarks), it is quite possible that Catalina may come in around 2.5x the code size of the equivalent byte code program, and be able to execute the programs from XMM RAM at around the same speed as SPIN byte codes from Hub RAM.
An example that might show this would be the Whetstone benchmark, so I have included a copy, as well as Catalina's results when running from XMM (below). I don't think the results are particularly impressive, but I think it would be interesting to compare them with CSPIN or Zog running the same code from Hub RAM.
The Whetstone program I used is included - I had to make a minor source code change to eliminate some DOS specific function calls (nothing to do with the benchmark itself - they were to read characters and to fetch the date).
Here are the results when all floating point functions (sin, cos, sqrt, exp, etc) are implemented in C (i.e. using the Catalina -lm option): Here are the results when using two additional cogs for the floating point functions (sin, cos, sqrt, exp, etc) implemented in PASM (i.e. uising the Catalina -lmb option): For MWIPS (only) this puts the Prop about on par with a PDP 11/34 (with FP), or slightly below a MicroVax 1.
It is not useful to compare XMM binary file sizes with Hub binary file sizes becasue the format is completely different, but the binary is quite large because the program also includes writing the results to a file, and therefore requires full Catalina file system support as well as maths libary support - this easily blows the 32k hub limit for LMM programs. The total code size for Catalina is 62,896 bytes, with the whetstone code itself (excluding all file and maths library functions) being 11,700 bytes.
It is therefore possible that Zog or CSPIN could fit the entire program into Hub RAM.
Ross.
@Heater -- you are a good quality human being. Know that, and keep pounding. I seriously enjoy your perspective.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Wondering how to set tile colors in the graphics_demo.spin?
Safety Tip: Life is as good as YOU think it is!
Yep, the relative jumps in LMM were clear to me already, that's what ZiCog does in it's LMM.
I was worried about the fact that I want to move the LMM code, say to the top of HUB with the dispatch table, so the start of LMM has to be passed to Zog through PAR so he can find it. Then he has to know the offsets of the routines within that LMM block.
Well, I guess that's not a problem, I'll just put a jump table at the beginning of the LMM block and get to the float routines through that. Then if the code changes for some reason everything will still work.
Then I though there was a problem getting out of LMM back to the interpreter. This morning that does not seem to be an issue[noparse]:)[/noparse]
I'm inclined to skip FCACH, at least initially, keep the thing as small as possible. I like to have free longs in COG. If we ever get to say, "Zog is done", then perhaps any optimisation that needs space can be applied.
I like the LMM subroutine idea.
@Dave:
I'll have a look at the SpinLMM float code. 51 LONGS for the LMM interpreter seems a lot to me, see above comments.
@John Abshier:
Not so, that 14K is what gets pulled in for +, -, *, / operators. I played with this a bit, compile with using just "+", then add "-", then add "*" etc. The binary gets progressively bigger. From 11428 bytes with just "+" to 17736 bytes with all four.
I don't think there are many linkers smarter than the GCC linker out there.
@RossH:
That's a lot of testing you have been doing. Those XMM results are impressive. A few details are missing from the results:
Which XMM board were you using, TriBlade, DracBlade or ...?
Are you keeping data and stack in HUB?
Why is the XMM binary format different for XMM ? Is that because data is still in HUB?
Currently Zog has bee running the exact same binary image from HUB or XMM. Stack and data move to XMM as well. Shall we say "slow".
I don't think Zog will fit Whetstone in HUB. Not until I've done something with the float in LMM and then modified the float libs accordingly. The latter could take a long time to happen.
@potatohead:
Wow, thank you. I don't know what to say. For once.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Wondering how to set tile colors in the graphics_demo.spin?
Safety Tip: Life is as good as YOU think it is!
So we have a PDP 11/34 in a match box not just a CP/M machine [noparse]:)[/noparse]
Whetstone won't go here, no sin, cos, atan etc [noparse]:([/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
All the XMM benchmark testing is on the RamBlade. The memory mode I used is -x2 (which is data and stack in Hub RAM and code in XMM RAM). Catalina also has a -x5 option (data and code in XMM and stack in hub RAM) but that's slightly slower. I can't see the point of ever putting the stack in XMM RAM - it would be way too slow.
The XMM file format is around 32k larger than the equivalent Hub binary format - the first 32k is used for the kernel, drivers and plugins, and all 32k is reserved even if it is mostly empty - the actual program starts after the first 32k. So as soon as you compile with -x2 or -x5 the files immediately get substantially larger even if the code size hasn't changed at all.
Ross.
P.S. The Catalina Optimizer replaces all LMM long jumps with relative jumps throughout the entire program (including library functions) - that's its single biggest code size savings. The next biggest saving often comes from the automatic 'inlining' of function calls, and then there are some other 'tidy up' optimizations to eliminate unnecessary instructions.
Dave
That's what I have in the latest ZiCog Z80 emulator. (Yes that project is still alive.)
Thing is those float functions take 180 longs or so. That's 720 bytes eaten out of HUB RAM. Plus 128 bytes already used by the Zog dispatch table. Call it a K byte. I'll have to make float support optional.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Now, I also discovered that the whetstone bench mark code uses "sin", "cos", "tan" etc. Turns out that these are defined as "
double sin(double x);" and so uses double precision floats, 8 bytes for the ZPU GCC. If one wants single precision floats one has to use "sinf", "cosf", "tanf" etc. These operate of on 4 byte floats and give a slightly smaller executable. Presumably a lot faster too.
I added some macros to whetstone like "#define sin sinf" to get the single precision versions. Whetstone binary is 75520 bytes for doubles and 72192 for single precision.
Still no chance to run this from HUB RAM. Even if all the printing was removed it would still be over 32K.
RossH. What sizes is Catalina using for sin, cos, tan etc? Does it support double precision 8 byte floats?
Now I have a challenge on my hands to get Zog to do this from HUB RAM not to mention do it with any usable speed.
The Zylin guys were discussing adding float support to the ZPU instruction set some how a while back. I should try to find out what they are up to.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.