Shop OBEX P1 Docs P2 Docs Learn Events
Catalina 2.6 - a FREE C compiler for the Propeller - The Final Frontier! - Page 2 — Parallax Forums

Catalina 2.6 - a FREE C compiler for the Propeller - The Final Frontier!

2456720

Comments

  • RossHRossH Posts: 5,573
    edited 2009-10-02 02:49
    @Baggers,

    I've just completely deleted and reinstalled MinGW/GCC and then recompiled LCC with it - using both the makefile included in Catalina and Hippy's batchfile. I still can't reproduce your problem. Here are the steps I used:

    - Download the MinGW installer (available here). I used version 5.1.6. Run the installer.

    - If it tells you there is a later version of the installer, and asks if it should upgrade, say no (it doesn't seem to work, and is not necessary anyway).

    - When it asks whether to install the previous/current/candidate version, select 'current'.

    - When it asks you to choose what to install (in addition to the base stuff), add in the Make utility (you will need for other things even if you use Hippy's script to build LCC).

    - Next, install MSYS (available here - this adds some necessary common utilities - like 'cp'.

    - Set up your paths appropriately - this depends on where you installed the above packages (e.g. on my system I say SET PATH=C:\Mingw\bin;C:\MinGW\msys\1.0\bin;%PATH%)

    - compile LCC.

    Note: I'm not sure why, but the version of 'cp' I end up with doing things this way seems picky about '\' vs '/' in makefiles. So the file makefile.mgw in source/catalina now needs to be edited to work properly. This may also apply to other files.

    If this doesn't solve your problem, the only thing I can think of is that there is some conflict between the MinGW version of GCC and another C compiler on your system - perhaps the compiler is pickup up incompatible include files or libraries. Check your environment for variables like INCLUDE or LIBRARY_PATH etc.


    Ross.
  • BaggersBaggers Posts: 3,019
    edited 2009-10-02 21:32
    YAY cheers Ross, and Hippy,
    basically I forgot to add MSYS, and also I installed to C:\Program Files\Catalina\MinGW and I'm guessing it didn't like the space in the filename on install
    anyway, not it compiles the source, well, I assume it does, as it doesn't output anything or any errors, so I'm guessing that's a good thing [noparse]:D[/noparse] lol
    will have a play tomorrow.
    Cheers again for the help guys [noparse]:)[/noparse]

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    http://www.propgfx.co.uk/forum/·home of the PropGFX Lite

    ·
  • RossHRossH Posts: 5,573
    edited 2009-10-02 22:59
    @Baggers,

    Good stuff. You will probably also have problems building the library, because I've realized that my Windows environment and build instructions were all a bit flaky - I'm working on updating them for release 2.1 (which also fixes a kernel bug). This release should be out today or tomorrow.

    I was using MinGW and UnixUtils, but I've decided to standardize on MingGW and MSYS. And I'm beginning to think Hippy's right - Windows and makefiles just don't mix at all well!

    Ross.
  • BaggersBaggers Posts: 3,019
    edited 2009-10-03 13:30
    cool, will look forward to it [noparse]:)[/noparse]

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    http://www.propgfx.co.uk/forum/·home of the PropGFX Lite

    ·
  • RossHRossH Posts: 5,573
    edited 2009-10-04 06:34
    @All

    I've just updated the sources and binaries for Catalina 2.1. There are no functional changes in this release. The only changes are as follows:
    • Fixed a problem with initialized structures that I first noticed when compiling JZIP (which now compiles correctly with this release). This may have also affected other programs.
    • Clean up the build scripts and makefiles for Windows to ensure that they work correctly under MinGW/MSYS - see the documentation for more details.
    • Fixed up some out of date documentation in the reference manual, and added a note about C symbols vs Catalina symbols.
    Any problems please let me know.

    Ross.
  • hippyhippy Posts: 1,981
    edited 2009-10-04 21:09
    @ Ross : Not sure if significant, but when building 2.1 with the same MAKE.BAT as previous, there's a discrepancy with the original rcc.exe as installed from your .zip files ...

    2009-10-04 15:48 1,357,329 rcc.exe.original
    2009-10-04 22:05 1,357,841 rcc.exe
  • RossHRossH Posts: 5,573
    edited 2009-10-05 01:20
    @Hippy,

    Thanks. I'll investigate.

    Ross.
  • RossHRossH Posts: 5,573
    edited 2009-10-05 03:59
    @Hippy,

    I don't think the differences are significant - the size difference is due to the embedded paths in the symbol information, because your batch file uses different (but equivalent) path names (e.g. 'src' vs './src'). If you strip the symbol information out of the executables (i.e. 'strip rcc.exe') you end up with the binaries being exactly the same size - the remaining differences are probably just due to different ordering of objects within the executable.

    Ross.
  • RossHRossH Posts: 5,573
    edited 2009-10-06 12:48
    Another Catalina Curiosity ... self-hosted Pascal development on the Propeller!

    This is the P4 Pascal Compiler and Interpreter compiled for the Propeller using Catalina.

    The original, and more documentation, can be found here

    I used the C sources (not the Pascal sources) for both the compiler and the interpreter. Some trivial changes to the C source were required to make it ANSI C compliant.

    I have also added a 'wrapper' program to the compiler that allows the input of a file name on the Propeller (since you cannot enter any command line parameters on the Prop - something I will rectify soon).

    This program can only be run on the TriBladeProp with 1Mb SRAM installed (the Hybrid with the HX512 does not have enough RAM). A makefile ('Makefile.Catalina') is provided.

    To compile the programs:
    set CATALINA_DEFINE = TRIBLADEPROP CPU_2 PC
    make -f Makefile.Catalina clean
    make -f Makefile.Catalina pcom.binary pint.binary
    
    As implied above by the setting of the 'CATALINA_DEFINE' environment variable, the program runs on Blade #2, and uses a PC Terminal Emulator for user input and output.

    I have included a trivial 'sample.pas' program, so after building the programs, copy the following to a FAT16 SD Card (and rename the files as shown):
    pcom.binary => PCOM.BIN
    pint.binary => PINT.BIN
    sample.pas  => SAMPLE.PAS
    
    The programs must be loaded and executed using the Catalina Generic Program Loader (included with Catalina).

    To compile a Pascal program, execute the compiler program PCOM.BIN and when it prompts for a file name, enter the name of the program to be compiled (e.g. 'sample.pas'). The compiler will print the file as it compiles it (with any errors noted) and leave the resulting executable (assuming it compiles correctly) in a file called 'prr'.

    To run the compiled program (i.e. 'prr'), execute the interpreter program PINT.BIN. This program requires no parameters - any program executed will take its input from stdin and put its output on stdout. Note that you need to reset the propeller after each execution of the compiler or interpreter.

    The execution speed is a few hundred lines of Pascal per second - not too bad considereing this compiler does not generate native PASM - it compiles the pascal source code into an machine independend assembly-like intermediate language, which must then be interpreted (well, at least it's slightly faster than Bywater BASIC!!!).

    Enjoy!

    Ross.

    Edit: With the release of Catalina 2.3, all the Catalina Curiosities have been collected into a single zip file - see the second post in this thread.
  • RossHRossH Posts: 5,573
    edited 2009-10-08 10:07
    @all

    Earlier in this thread, Javalin asked for some benchmarks with SPIN, reporting that ICCv7 was about 5x faster. Well, that's about what I might have expected, so I did some benchmarks of Catalina. The trouble is (as I discovered) that it is just sooooo dependent on the benchmark program you choose to use. Coming up with a good benchmark is not as easy as you might think - e.g. any algorithm involving too much multiplication or division (such as dhrystone) is out, since multipltication and division are so relatively expensive when implemented in software that all you are really comparing is the difference between the implementation of the software multiply and divide subroutines. In the end I just wrote a trivial program that exercises a 'for' loop, a procedure call and a simple 'if' statement involving bit operations, addiiton and subtraction.

    The results of the first benchmark are as follows:

    On HYBRID:

    SPIN: 20 seconds

    Catalina Tiny mode: 8 seconds
    Catalina Small mode: 52 seconds
    Catalina Large mode: 102 seconds

    On TRIBLADEPROP:

    SPIN: 25 seconds

    Catalina Tiny mode: 11 seconds
    Catalina Small mode: 55 seconds
    Catalina Large mode: 70 seconds

    I think this benchmark gives a fairly realistic comparison, but the results are a little disappointing. As you can see, Catalina's Tiny mode is twice as fast as SPIN on this benchmark - but I was expecting better than that. Catalina's Small XMM mode is about 2-3 times slower than SPIN, and Catalina's Large XMM is about 3 - 5 times slower than SPIN. Again, I was expecting a slightly better outcome. But, as I was also expecting, the results do highlight how much impact the XMM hardware can have, with the TriBladeProp (at 80Mhz) significantly outperforming the Hybrid (at 96Mhz) in Catalina Large mode.

    Looks like there is still plenty of optimizing to do on Catalina!

    But it is also clear after playing around a bit that you can get pretty much any result you want by tweaking the benchmark appropriately. For example, Benchmark 2 omits the procedure call - it's essentially just a 'for' loop and a simple addition. Here Catalina is about 5 times faster:

    Benchmark 2:
    =========

    On HYBRID:

    SPIN: 16 seconds
    Catalina Tiny mode: 3 seconds

    Anyone who wants to submit a better benchmark is welcome to do so. Please implement them in both ANSI C and SPIN. I'll run them and report the results.

    Ross.

    Edit: Removed a reference to SPIN not being able to do recursion (which of course it can) - I was thinking of PASM when I wrote that.
  • Cluso99Cluso99 Posts: 18,069
    edited 2009-10-08 10:30
    Ross:

    I am now running the TriBlade with a 6MHz xtal for 96MHz - as soon as I get back to Sydney sometime next week, I can give you / solder in a 6MHz xtal. Sapeiha has had 120MHz running on the TriBlade for >6months.

    It's a shame Catalina will not run in 512KB. My new RamBlade will only have 512KB and it's XMM read is 50% faster and write is 33% faster. I am also aiming for 120MHz - Just ordering xtals to 160MHz for testing as I have done a few things to hopefully allow the higher overclocking. RamBlade (prop+sram+uSD) is a cheap add-on to any existing prop, especially the Proto and Demo boards. It also has some other options. ETA 2-3 weeks so expect an announcement of full details shortly. It is fully smt so nice & small & available preassembled.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade, RetroBlade,·TwinBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: Micros eg Altair, and Terminals eg VT100 (Index) ZiCog (Z80) , MoCog (6809)
    · Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
  • RossHRossH Posts: 5,573
    edited 2009-10-08 10:46
    Hi Cluso,

    I have some 6.25Mhz crystals that I bought from Bill Henning. I use them on my Hydra (which has a socket for the crystal) but I haven't had the courage yet to attempt de-soldering the existing crystals from the TriBladeProp. But I'm picking up a new pair of glasses tomorrow, so maybe ... :scool:
    Ross.

    P.S. Just to clarify for other readers of this thread : Catalina programs work fine in 512Kb - what Cluso is referring to is that 512Kb is not enough RAM to run the compiler itself natively on the Propeller, instead of using it as a cross compiler on a PC - i.e. to do self-hosted Propeller development in C. I don't actually know yet how much RAM is required for that - but at least 2Mb, I think.
  • Cluso99Cluso99 Posts: 18,069
    edited 2009-10-08 11:00
    Ross: I am thinking of making a modified TriBlade/RamBlade driver which does an extremely simple form of virtual memory so that Catalina could run. Would be much slower and a penalty on the write wearing on the SD, but the RamBlade will have a self-contained option (keyboard and TV), so this would be perfect. The driver would be transparent, so no changes to Catalina would be required.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade, RetroBlade,·TwinBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: Micros eg Altair, and Terminals eg VT100 (Index) ZiCog (Z80) , MoCog (6809)
    · Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
  • RossHRossH Posts: 5,573
    edited 2009-10-08 11:11
    @Cluso,

    Sounds interesting, provided speed is not so important - which I suppose it isn't for a compiler. After all, it doesn't really matter if it takes a minute or two to compile a program instead of a couple of seconds. Might even encourage better programming practice ('you young people today don't know how lucky you are! In MY day ... ' tongue.gif )

    But a bigger problem will be fitting the necessary virtual memory driver code into the XMM kernel - unless the driver could run in a separate cog?

    Ross.
  • Cluso99Cluso99 Posts: 18,069
    edited 2009-10-08 11:17
    Ross:

    I forgot. I presume you are not using my driver. But we can still make a simple virtual memory driver in another cog to swap blocks of ram. Speed is not an issue to be able to run it natively for those who want to do this. I am not thinking of anything complex.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade, RetroBlade,·TwinBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: Micros eg Altair, and Terminals eg VT100 (Index) ZiCog (Z80) , MoCog (6809)
    · Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
  • RossHRossH Posts: 5,573
    edited 2009-10-08 11:25
    @Cluso,

    No, I wrote my own access routines based on your original code, but only including the small part I needed. But I might have a go at a general purpose interface to an XMM driver running in another cog - I meant to do this anyway because I think XMM caching could significantly improve Catalina's performance in some cases (and after the slightly disappointing benchmarking results, I think I need to do some work on the performance aspects of Catalina sooner rather than later).

    It's all a matter of finding the time.

    Ross.
  • Cluso99Cluso99 Posts: 18,069
    edited 2009-10-08 15:25
    It is working, so that is the main thing smile.gif Speedups can come when time permits. You also have to be careful not to take bit hits with passing between cogs.

    In ZiCog, we have 2 lots of accesses. One is just fetching and writing bytes and is done within ZiCog. However, all SD and block RAM access is done within a driver in another cog that has been optimised.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade, RetroBlade,·TwinBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: Micros eg Altair, and Terminals eg VT100 (Index) ZiCog (Z80) , MoCog (6809)
    · Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
  • RossHRossH Posts: 5,573
    edited 2009-10-09 02:41
    @Cluso,

    How do you coordinate accessing the shared XMM hardware from multiple cogs? The hub locks tend to be too slow to be much practical use - do you have a faster mechanism?

    Ross.
  • heaterheater Posts: 3,370
    edited 2009-10-09 06:36
    RossH: In answer to you question to Clusso. Currently in ZiCog only the PASM Z80 emulator or the Spin peripheral hardware emulator object are executing at any moment. They both wait for each other like co-routines so there is no contention for EXT RAM access.

    Not sure where Clusso is headed with this but I guess all Spin codes using RAM should go through a common driver object, probably with locks. Accessing from PASM may always be a bit specialized for a given application.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • Cluso99Cluso99 Posts: 18,069
    edited 2009-10-09 08:10
    The driver always leaves the Ram in a specific condition. Likewise the same with ZiCog. So they both can co-exist. The lock mechanism is not used because the drivers always leave the pins in a predetermined position when passing off to another object.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade, RetroBlade,·TwinBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: Micros eg Altair, and Terminals eg VT100 (Index) ZiCog (Z80) , MoCog (6809)
    · Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
  • RossHRossH Posts: 5,573
    edited 2009-10-09 08:55
    When I get some time I'll have a play with sharing XMM access between two or more cogs. Now that Catalina seems to be virtually complete, this is very much in line with what I originally wanted to do with the Prop anyway.

    Ross.
  • ImageCraftImageCraft Posts: 348
    edited 2009-10-09 10:07
    Ross, I did not look at your benchmark or the internal of Catalina. However, LCC makes lots of assumption about certain things. We do not use the straight lburg code generator. We have a separate optimization pass on top to get the performance we need. Heck, we can easily get another 20% better performance on ICCv7 for Propeller right now, but I am waiting for the market to catch up. Right at this moment, we are concentrating on the next-gen IDE which will benefit all of our platforms.

    Regardless of my digression, another thing to remember is certain construct is just going to be lousy for LMM. I don't know about your implementation, but for us, it drove me nuts that someone proposed fibonacci as a reasonable benchmark. As it's basically just function calls, and LMM+register style is just plain lousy for that and we only got 3x over Spin on that one (I think). So your tests may just be hitting your bad spot as well.

    ICC is about 5-8x faster than Spin on average.
    RossH said...
    @all

    Earlier in this thread, Javalin asked for some benchmarks with SPIN, reporting that ICCv7 was about 5x faster. Well, that's about what I might have expected, so I did some benchmarks of Catalina. The trouble is (as I discovered) that it is just sooooo dependent on the benchmark program you choose to use. Coming up with a good benchmark is not as easy as you might think - e.g. any algorithm involving too much multiplication or division (such as dhrystone) is out, since multipltication and division are so relatively expensive when implemented in software that all you are really comparing is the difference between the implementation of the software multiply and divide subroutines. In the end I just wrote a trivial program that exercises a 'for' loop, a procedure call and a simple 'if' statement involving bit operations, addiiton and subtraction.

    The results of the first benchmark are as follows:

    On HYBRID:

    SPIN: 20 seconds

    Catalina Tiny mode: 8 seconds
    Catalina Small mode: 52 seconds
    Catalina Large mode: 102 seconds

    On TRIBLADEPROP:

    SPIN: 25 seconds

    Catalina Tiny mode: 11 seconds
    Catalina Small mode: 55 seconds
    Catalina Large mode: 70 seconds

    I think this benchmark gives a fairly realistic comparison, but the results are a little disappointing. As you can see, Catalina's Tiny mode is twice as fast as SPIN on this benchmark - but I was expecting better than that. Catalina's Small XMM mode is about 2-3 times slower than SPIN, and Catalina's Large XMM is about 3 - 5 times slower than SPIN. Again, I was expecting a slightly better outcome. But, as I was also expecting, the results do highlight how much impact the XMM hardware can have, with the TriBladeProp (at 80Mhz) significantly outperforming the Hybrid (at 96Mhz) in Catalina Large mode.

    Looks like there is still plenty of optimizing to do on Catalina!

    But it is also clear after playing around a bit that you can get pretty much any result you want by tweaking the benchmark appropriately. For example, Benchmark 2 omits the procedure call - it's essentially just a 'for' loop and a simple addition. Here Catalina is about 5 times faster:

    Benchmark 2:
    =========

    On HYBRID:

    SPIN: 16 seconds
    Catalina Tiny mode: 3 seconds

    Anyone who wants to submit a better benchmark is welcome to do so. Please implement them in both ANSI C and SPIN. I'll run them and report the results.

    Ross.

    Edit: Removed a reference to SPIN not being able to do recursion (which of course it can) - I was thinking of PASM when I wrote that.
  • RossHRossH Posts: 5,573
    edited 2009-10-09 11:14
    Richard,

    Yes, I think the experience you guys bring to the ICC code generator shows here.

    I agree about Fibonacci being a very poor benchmark, and in hindsight my own benchmark probably just highlights the relatively high overhead of Catalina function calls. Some more work is obviously required there - but to be honest it isn't high on my 'to do' list at this point.

    I also agree that LMM being inherently slow at some things - it's even truer (more true?) of XXM. It all comes down to whether the advantage of being able to run very large C programs 'off the shelf' outweighs the disadvantage that they will run relatively slowly on the Prop.

    I don't really know the answer to that - but I do know that it's a hell of a lot of fun finding out!

    Ross.
  • hippyhippy Posts: 1,981
    edited 2009-10-09 17:11
    @ Ross, Richard : I had the same experience with my own kernels on stack frames and function calls ( using a stack machine there ). I don't think Fibonacci is a fair benchmark per se but function calls are pretty common in most code so I think they have to be included in benchmarking.
  • RossHRossH Posts: 5,573
    edited 2009-10-10 09:16
    @Hippy, @Richard ...

    Yes, I agree that Fibonacci would have to be just about a 'worst case'. But considering how much time I spent on getting register passing working under Catalina, I thought I might give it a spin (so to speak) - so enclosed is a simple Fibonacci benchmark. I think the results are quite interesting. Catalna is still substantially faster than SPIN, but the gap is the narrowest I've seen. SPIN is obvoiusly not too shabby at stack manipulation itself - or perhaps it is just the case say that stack manipulation (which always requires hub interaction no matter what language you use) is always going to be the great leveller amongst propeller languages.

    The results (Hybrid 96MHz):

    SPIN: 23 secs

    Catalina Tiny Mode: 14 sec
    Catalina Small Mode: 60 sec
    Catalina Large Mode: 60 sec

    I hate to think what the results would look like WITHOUT register passing - C might not even beat SPIN!

    Ross.

    Edit: Wrong attachment!
  • hippyhippy Posts: 1,981
    edited 2009-10-10 13:58
    @ RossH : Looking at your LMM kernel functions for PSHM/POPM there's scope for speed improvement there.

    There's no bale-out when the bit mask becomes zero so you cycle through all 24 registers regardless. Detecting a zero bit mask may also save a few cycles in looping.

    Also, the bit mask itself may not be optimal; reordering the registers in cog and consequently their bit position in the bit mask would save 15 times of 'doing nothing' round the loop in this case.

    The jmp's from LMM code to jmp's in kernel to actual handler is 4 cycles of inefficiency - not a lot but it all adds up. It shouldn't be that hard to generate an 'include file' for each kernel which allows the generated code to have jmp's to the actual handler addresses.

    If you want to maintain that memory mapping to 'leave things as they are', make linking etc easier then consider JIT fixups. The first call to one of those jmp's in kernel actually jmp's to a routine which will set the jmp <dst> in LMM to where the handler actually is. That has overhead but the next call into kernel will go directly to the handler. You could add JIT fixup regardless of what Catalina generates; if fixups are done at compile time it just becomes wasted overhead, if not, JIT kicks in. The JIT code could be LMM held itself in hub; even greater overhead on first call but may save on precious cog space.

    As kernels all tend to be a rdlong of LMM opcode, a subsequent rdlong to get an operand for a kernel call I found that time/cycles between the two rdlongs was a critical factor. That may mitigate against removing jmp to jmp handling ( that is, removing it saves nothing ), but there may be scope for gains. Ultimately from one LMM opcode to the next is two, three or more rdlongs and saving in one area may be lost later. This is IMO the limiting factor on kernel speed. You'd have to do a cycle count analysis on each case.

    There's also a natural tendency to put 'jmp #LMM_fetch" at the end of each handler, where it may be better to do a rdlong and then a jmp to execute it. It's not easy working all this out by hand and I expect the most optimal kernels will be program generated. Not saying that would be easy.

    There's also plenty of scope for inlining real PASM rather than using calls to the kernel, either done by compiler or JIT fixups. Don't forget that executing any operand with bits 18-21 clear is a NOP, and 'fall through' will be quicker than executing any kernel call. Your BRxx kernel calls would probably benefit greatly from being conditional "<cond> jmp #BRxx" when they can be.

    Also, where an operand may be 9-bits or less, consider two versions of the kernel handlers which loads from next long or loads from a temp register set before the kernel call. You don't have to provide two of everything but in some cases there will be benefits, and this may offset some rdlong-to-next-rdlong limitations.

    For 'brute force' you could JIT fixup the entire program as a separate optimisation pass of the compiler or as run-time pass before execution. That would also allow 'reverse LMM' with a corresponding lower run-time overhead.

    For 'elegance' you can generate what an optimal kernel, optimal register ordering is from the actual generated LMM code.

    Obviously a lot of work all-in, but some things will be quite easy to achieve for little effort and even small gains may have huge pay-back overall. You don't have to implement for all targets; LMM first then XMM and EMM later.
  • RossHRossH Posts: 5,573
    edited 2009-10-11 00:25
    @Hippy,

    Thanks for the great suggestions. Some of these I'd already been considering, others I'd never thought of.

    I knew the PSHM/POPM loop was very inefficient - it was a quick hack designed to reduce program space, not improve efficiency. It has stayed in there because my 'todo' list still has quite a few other embarrasingly bad bits of coding that I should fix first blush.gif.

    I'd also started on some work to eliminate the kernel jump table - for one thing, I could do with the cog space for other purposes. But there are other reasons why I would quite like it to stay - not the least of which is that it really does complicate the compilation process to eliminate it!

    I'll see which of your other suggestions I can include in the next release. Inlining should be pretty straightforward to implement - I may have a go at that. I'd also welcome you (or anyone for that matter) having a go at these ideas - some of them can be done quite independantly.

    But overall Catalina appears to be reasonably robust and mature at this point, and I won't be rushing to get a new release out specifically to address performance. Although it's not quite as good as I'd hoped it would be, it actually stacks up pretty well against the competition. As I'd always expected, it turns out to be faster at some things, not so fast at others. The specific things it has turned out to be faster at surprised me a bit - they're not what I would have predicted.

    But the main area of improvement I'll be working on for the next release or two will simply be to reduce the overall code size. There's quite a lot of speed to be gained just by eliminating some of the current inefficiencies in the code generation - this is especially true for XMM programs.

    Ross.
  • RossHRossH Posts: 5,573
    edited 2009-10-14 12:49
    All,

    PATCH TO FIX 'fseek'
    ==============

    Attached to the first email in this thread is a small patch to fix a bug in the fseek library function (the SEEK_CUR and SEEK_END options to fseek were not implemented). It is a minor change which does not require Catalina to be rebuilt. Just unzip the attached file in your Catalina main directory, over the top of the Catalina 2.1 distribution (source or binary, Windows or Linux).

    Ross.
  • RossHRossH Posts: 5,573
    edited 2009-10-14 12:59
    Another Catalina Curiosity ... Dumbo BASIC

    This is a beta release of Dumbo BASIC - Dumbo BASIC will eventually be a substantially complete GWBASIC clone for the Propeller. Dumbo BASIC compiles under Catalina using the large memory model (it requires 512KB XMM RAM). While not complete, this release of Dumbo BASIC (0.3) can already be used to execute complete basic programs, such as Star Trek and the classic ELIZA psychoanalyst program - albeit slowly (e.g. it can take 20-30 seconds for Eliza to respond to each line of input).

    Dumbo BASIC is based on 'Mini Basic' by Malcolm McLean, but has been heavily modified to add many common basic statements that Mini Basic lacks, new types, and also some tweaks to support GWBASIC style syntax.

    Dumbo BASIC currently executes basic programs about 100 lines per second on the Propeller - about 10 times faster than Bywater BASIC, but still not very fast. However, I expect that after some simple optimization to improve the speed by several hundred percent, making it useful enough to run many old BASIC programs.

    The ELIZA basic program and several versions of Star Trek are included, along with Dumbo BASIC binaries for the TriBladeProp and the Hydra. There is also a DOS executable (the DOS version requires MinGW and GCC to compile).

    NEW VERSION - 0.3

    Version 0.3 of Dumbo BASIC adds the following:
    • includes several different basic versions of Star Trek.
    • many more GWBASIC language features added - see the README.txt document for details.
    • many more bugs fixed - and probably even more introduced!
    Enjoy!

    Edit: With the release of Catalina 2.3, all the Catalina Curiosities have been collected into a single zip file - see the second post in this thread.
  • heaterheater Posts: 3,370
    edited 2009-10-14 13:33
    Yawn, bin there done that. We have had ELIZA running on the Prop for a long time now. Under MBASIC under CP/M under ZiCog. I didn't like to mention it for fear of being accused of baiting Dr Jim. Now you can take that flak [noparse]:)[/noparse]

    All cred for yet another Prop language. This one looks like it could shape up to be useful.

    By the way I was contemplating the idea of creating a PASM version of the Pascal interpreter PINT. Now that you have a Pascal compiler might be nice to turbo the interpreter for it.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
Sign In or Register to comment.