Shop OBEX P1 Docs P2 Docs Learn Events
Big Spin - is it still a pipedream? - Page 7 — Parallax Forums

Big Spin - is it still a pipedream?

1457910

Comments

  • RossHRossH Posts: 5,462
    edited 2011-02-20 03:44
    Hi Jazzed,
    jazzed wrote: »
    I've been scratching my head over what part of the cache interface would be big-endian specific.
    Could you please snip an example here and explain what you mean?

    I mean this code ...
    mov     fn, write
            call    #BSTART
    pw      rdlong  data, ptr
            mov     bits, #32
            call    #send
    
    ... and the corresponding read code. They both set the SPI RAM address (in BSTART), then transmit (or retrieve) the long to be stored as 32 successive bits starting with the most significant bit first. This results in the most-significant byte being stored in the lowest address (which auto increments because the SPI RAM has been set up in sequential mode).

    This is big-endian storage - i.e. the most significant byte is stored first (i.e. at the lowest address). Nothing wrong with this - provided it is always read it back the same way. But the Prop is a little-endian architecture, with the least significant byte stored first.

    This won't affect you if you only ever access the SPI RAM as reading/writing longs via the cache. But it broke my first attempt to load a program because my loaders do not use the cache - and naturally they store programs in SPI RAM the same way it would be stored in Hub RAM. Then they start the cache. Of course, I could change all my loaders - but in my view, it is better to always store programs the same way so that the use of the cache is optional.

    Ross.
  • jazzedjazzed Posts: 11,803
    edited 2011-02-20 07:58
    Well that explains all my head scratching. I use the cache driver for BigSpin, so I never noticed. I've never looked at the internals of the C3 driver for more than a a few seconds. Just keep a little endian version of C3 for Catalina.
  • RossHRossH Posts: 5,462
    edited 2011-02-20 12:46
    jazzed wrote: »
    Well that explains all my head scratching. I use the cache driver for BigSpin, so I never noticed. I've never looked at the internals of the C3 driver for more than a a few seconds. Just keep a little endian version of C3 for Catalina.
    Yes, I'll probably end up with my own version anyway for other reasons. Just keep it in mind, since it will cause problems if anyone uses the cache driver for purposes that need to interact with other Prop languages (e.g. to read/write data from the Flash RAM from both Spin and BigSpin).

    Ross.
  • jazzedjazzed Posts: 11,803
    edited 2011-02-20 12:55
    RossH wrote: »
    Yes, I'll probably end up with my own version anyway for other reasons. Just keep it in mind, since it will cause problems if anyone uses the cache driver for purposes that need to interact with other Prop languages (e.g. to read/write data from the Flash RAM from both Spin and BigSpin).

    Ross.
    Sure. I'll chat with David about the possibility of doing a little endian version.
  • RossHRossH Posts: 5,462
    edited 2011-02-22 02:24
    All,

    I just got the DracBlade version of the caching driver working with Catalina. The results are quite interesting:
    Hardware        |  Language   |  FIBO(20) time  |  FIBO 0 to 26
    ----------------+-------------+-----------------+--------------
    C3 (LMM)        |  Catalina C |  306ms          |  11s
    C3 (Hub)        |  SPIN       |  547ms          |  30s
    DB XMM cached   |  Catalina C |  1243ms         |  58s
    DB XMM uncached |  Catalina C |  2659ms         |  2m6s
    ----------------+-------------+-----------------+--------------
    
    The caching driver speeds up execution from external RAM by a factor of two - so Catalina can now execute programs hundreds of kilobytes in size on the DracBlade at speeds only around two times slower than SPIN executing from Hub RAM - and this is on a platform with fairly slow parallel RAM (sorry, Dr_A!).

    I think that wih a bit more optimization, and on a platform with fast parallel RAM (perhaps using Jazzed's SDRAM card) it may eventually be possible to have C executing from external RAM at speeds quite comparable to SPIN executing from Hub RAM.

    Thanks are due to Jazzed and David for a nice piece of work on the caching driver (and also to Bill Henning who did the original work on VMCOG, and to Dr_Acula who makes the DracBlade!)

    Ross.
  • Cluso99Cluso99 Posts: 18,069
    edited 2011-02-22 02:43
    I am curious Ross. Have you tried these on the TriBlade?
  • RossHRossH Posts: 5,462
    edited 2011-02-22 03:05
    Cluso99 wrote: »
    I am curious Ross. Have you tried these on the TriBlade?

    No - not yet. A version of the caching driver was already written for both the C3 and the DracBlade so I didn't actually need to do very much to port it to the DracBlade once I had it working on the C3.

    I thought I might try your RamBlade next - I think that's the fastest XMM platform I have.

    The upside of the caching driver is execution speed, but the downside is that it consumes a lot of Hub RAM, and also an additional cog. So it won't suit all applications - but it is a useful option to have available. Eventually, I will add support for all the XMM platforms currently supported by Catalina.

    Ross.
  • Heater.Heater. Posts: 21,230
    edited 2011-02-22 03:14
    And thanks also to me. Well a little anyway:)

    As far as I remember the first code executed from external RAM on a Prop was achieved by using the Z80 emulator on a TriBlade which then provided the motivation for DracBlades etc. (Of course I may be wrong anyone have any earlier claims?).

    No, I'm not really claiming any credit for todays achievements, it's just that it has been interesting to watch the development of ext RAM solutions since those early "CP/M on a Prop" days. Now we have Big Spin and "big C". Amazing.
  • RossHRossH Posts: 5,462
    edited 2011-02-22 03:23
    Heater. wrote: »
    And thanks also to me. Well a little anyway:)

    As far as I remember the first code executed from external RAM on a Prop was achieved by using the Z80 emulator on a TriBlade which then provided the motivation for DracBlades etc. (Of course I may be wrong anyone have any earlier claims?).
    Fair enough - some credit should also go to both you and Cluso!

    Ross.
  • Cluso99Cluso99 Posts: 18,069
    edited 2011-02-22 03:56
    While I built the TriBlade (and the HexBlade) to add external memory, I was not really the first. I later discovered IIRC Beau proposed a circuit, and of course the Hydra was out (which I also knew nothing about).

    But the catalyst was heaters ZiCog which IIRC I discovered while doing the TriBlade. It was a sheer delight and a frenzy to get that CPM emulation running on the prop. While not much is happening on that front because of Zog and other OSes, I think it really started the whole ball rolling with what could really be achieved on the prop. And it gets better every day!!! It is just like Apple really got put on the map with SuperCalc.

    We are still waiting for that proper elusive PropOS. So far, to me, Sphinx makes the best attempt at getting there. I just have not had the time to work on it much over the past year.

    RossH: I am really interested in the differences between the RamBlade and TriBlade. I have a RamBlade II coming, so I am obviously quite curious. It's an add-on component in other boards coming.
  • jazzedjazzed Posts: 11,803
    edited 2011-02-22 08:31
    RossH wrote: »
    The caching driver speeds up execution from external RAM by a factor of two - so Catalina can now execute programs hundreds of kilobytes in size on the DracBlade at speeds only around two times slower than SPIN executing from Hub RAM
    Excellent news Ross. Is your DracBlade running at 80MHz?

    I see I grossly underestimated the performance gain in my original prediction for DracBlade.
    jazzed wrote: »
    performance of applications on your board using a cache would most likely increase by more than 50%.

    Here's a more complete table of FIBO* measurements:
    Hardware        |  Language   |  RAM Model  | FIBO(20) time  |  FIBO 0 to 26
    ----------------+-------------+-------------+----------------+--------------
    C3 (LMM)        |  Catalina C |  All HUB    | 306ms          |  11s
    C3 (Hub)        |  SPIN       |  All HUB    | 547ms          |  30s
    SDRAM (Hub)     |  SPIN       |  All HUB    | 547ms          |  30s
    DB XMM cached   |  Catalina C |  S&D HUB    | 1243ms         |  58s
    C3 XMM cached   |  Catalina C |  S&D HUB    | 1468ms         |  1m10s
    DB XMM uncached |  Catalina C |  S&D HUB    | 2659ms         |  2m6s
    SDRAM cached    |  ZOG C      |  All XMEM   | 2773ms         |  2m18s
    uPropPC cached  |  BigSPIN    |  All XMEM   | 2834ms         |  2m17s
    SDRAM cached    |  BigSPIN    |  All XMEM   | 2858ms         |  2m19s
    C3 cached       |  BigSPIN    |  All XMEM   | 3601ms         |  2m53s
    C3 cached       |  ZOG C      |  All XMEM   | 3644ms         |  3m18s
    C3 XMM uncached |  Catalina C |  S&D HUB    | 7386ms         |  5m50s
    ----------------+-------------+-------------+----------------+--------------
    RAM Models:
    All HUB  = code,stack,data in HUB
    All XMEM = code,stack,data in XMEM
    C&D XMEM = code,data in XMEM
    S&D HUB  = code only XMEM (XMM for LMM type interpreters)
    
    *AGAIN, FIBO measurements are only good for relative value. The Heater FFT application when ported will provide more data points. Dave implemented Drhystone tests in SPIN, but it will not be a 100% apples-apples benchmark comparison, although it would provide some relative value; waiting for posted code for tests.

    My next contributions in this arena have already been previewed as hardware SpinSocket-Flash and uPropPC (MicroPropPC), and I'm vigorously trying to finish verifying the uPropPC concept prototype today before re-spinning hardware this week.

    The value ZOG brought to all this has been somewhat understated. I won't go into the details of why to keep flaming arrows out of the thread, but without ZOG, these results would have not happened.


    It's great that the outcome of all our external memory experiments and competition have helped us realize greater things as a group than one or two individuals were able to achieve in isolation. It has come with some bruises and some questionable heritage attribution, but in the end (which surely has not been seen) we are all beneficiaries.
  • Dave HeinDave Hein Posts: 6,347
    edited 2011-02-22 08:56
    jazzed wrote: »
    Drhystone tests have not been implemented in SPIN.
    Steve,

    I have implemented the Dhrystone benchmark in Spin. Of course, the Dhrystone benchmark is a C benchmark, and should be used to compare the performance of C compilers and the target platforms. I used cspin to convert the Dhrystone program to Spin. The Spin program could be tweaked to produce better performance, but it would need to be understood that there was was hand tuning involved in that case.

    Dave
  • jazzedjazzed Posts: 11,803
    edited 2011-02-22 09:24
    @Dave I was wondering about that (forgot to add AFAIK). Can you post your Spin implementation? If nothing else I can verify that it works on BigSpin. Thanks.
  • jazzedjazzed Posts: 11,803
    edited 2011-02-22 11:27
    @Ross, do you have numbers for C3 running code from SPI-Flash?
    I presume your previous results were for SPI-RAM only.
  • Dave HeinDave Hein Posts: 6,347
    edited 2011-02-22 12:12
    Steve,

    Here's the Spin version of the Dhrystone benchmark. I have it set to use clib_conio to run under SpinSim. You can change that to use clib to run on the hardware. The LOOP parameter is set for 20,000 loops. That's located about halfway down the file if you want to change it.

    Dave
  • RossHRossH Posts: 5,462
    edited 2011-02-22 14:15
    @All,

    What we have here is the fruits of both the collaborative effort and the good-natured competition that characterizes these forums.

    @Cluso,

    I will post both TriBlade and RamBlade figures when I have them. Shouldn't take too long since I already have working code to access the XMM RAM on these boards - I just need to plug this code into the caching XMM driver.

    @Jazzed,

    Yes, my DracBlade runs at 80Mhz. The first Flash-based model I am implementing on the C3 is to have stack in Hub RAM, all data segments in SPI and all code in Flash. If Flash reads are the same speed as SPI reads then the execution time will therefore be the same as the current speed. The main bit I don't have working yet is the loader necessary to organize all the memory segments correctly. As soon as I have that, I will release an update to Catalina.

    Ross.
  • jazzedjazzed Posts: 11,803
    edited 2011-02-22 16:27
    Guess this is more of a benchmark thread thing, but eventually some flavor of BigSpin will run this too.

    I get 476 DMIPS on an 80MHz Propeller running Dave's Spin Dhrystone(1.1) program.
    This seems slightly better than an original IBM PC/XT running at 4.77MHz with any OS + compiler.

    I get 625 DMIPS on a 104MHz Propeller.
  • RossHRossH Posts: 5,462
    edited 2011-02-22 16:51
    jazzed wrote: »
    Guess this is more of a benchmark thread thing, but eventually some flavor of BigSpin will run this too.

    I get 476 DMIPS on an 80MHz Propeller running Dave's Spin Dhrystone(1.1) program.
    This seems slightly better than an original IBM PC/XT running at 4.77MHz with any OS + compiler.

    I get 625 DMIPS on a 104MHz Propeller.

    I think you mean "Dhrystones/Second" (D/S), not DMIPS.

    DMIPS = D/S/1757 (1757 = D/S result of a VAX 11/780).

    So 625 DMIPS would be 625 times faster than a VAX 11/780, or faster than a 300Mhz Pentium II.

    Ross.
  • jazzedjazzed Posts: 11,803
    edited 2011-02-22 19:21
    RossH wrote: »
    I think you mean "Dhrystones/Second" (D/S), not DMIPS.

    More coffee LOL. Thanks for straightening me out.
  • Dr_AculaDr_Acula Posts: 5,484
    edited 2011-02-22 20:46
    This is all very exciting!

    Nothing wrong with some healthy competition. Indeed, the the harsher the comments/hotter the flames, the more productive the output *grin*.

    I am still a bit confused about what is being compared with what.

    Put Catalina in LMM on a C3 and that is not using the external memory, right? Neither is LMM on a Dracblade using external memory.

    Take external memory models only. Those either need a large (>32k) program for testing, or specifically need to be coded such that they really are running from external memory. Caching clearly is faster than non caching.

    But take a cached model, >32k program and run it on serial ram (C3), parallel ram with latches (dracblade) or parallel ram with direct access (Clusoblade). I would have thought the speed would increase as you move though those options.

    Do we have a test showing this?

    And maybe I'm confused here, but do we have a Big Spin program that is >32k yet?
  • RossHRossH Posts: 5,462
    edited 2011-02-22 21:11
    Dr_Acula wrote: »
    Put Catalina in LMM on a C3 and that is not using the external memory, right? Neither is LMM on a Dracblade using external memory.
    Correct - we should probably take the "C3" off the LMM and SPIN entries, since the same times would be achieved on any propeller platform
    Dr_Acula wrote: »
    Take external memory models only. Those either need a large (>32k) program for testing, or specifically need to be coded such that they really are running from external memory.
    Generally it's the latter - i.e. the programs could be run from Hub RAM, but are being explicitly compiled to run from external RAM instead
    Dr_Acula wrote: »
    But take a cached model, >32k program and run it on serial ram (C3), parallel ram with latches (dracblade) or parallel ram with direct access (Clusoblade). I would have thought the speed would increase as you move though those options.

    Do we have a test showing this?
    Soon we will have - Catalina will be directly comparable across all these platforms - but BigSpin may take a while longer.
    Dr_Acula wrote: »
    And maybe I'm confused here, but do we have a Big Spin program that is >32k yet?

    No - and where would we find anyone masochistic enough to want to write one? :lol:

    Ross.
  • jazzedjazzed Posts: 11,803
    edited 2011-02-22 21:22
    Dr_Acula wrote: »
    Put Catalina in LMM on a C3 and that is not using the external memory, right? Neither is LMM on a Dracblade using external memory.
    The HUB only performance numbers show baselines. It's a scientific method thingy.
    Dr_Acula wrote: »
    But take a cached model, >32k program and run it on serial ram (C3), parallel ram with latches (dracblade) or parallel ram with direct access (Clusoblade). I would have thought the speed would increase as you move though those options.
    The speed does increase. There is more work to do to provide better comparisons.
    Dr_Acula wrote: »
    And maybe I'm confused here, but do we have a Big Spin program that is >32k yet?
    Not just yet. I'm under pressure to finish some other things first. BigSpin is important so that *normal* Propeller users can write and run bigger programs. Not everyone wants to use C.
  • Dr_AculaDr_Acula Posts: 5,484
    edited 2011-02-22 21:55
    "Not just yet".

    Ah, that is a very enticing answer.

    Ross said "No - and where would we find anyone masochistic enough to want to write one? "

    All my projects seem to outgrow the propeller memory!

    Regarding different memory models, I have an IDE where you can compile to all sorts of different memory models. The aim is that you start with your standard model that fits in a propeller, but when it outgrows the memory, you can compile to a bigger model just by clicking a different button. This works for C and it works for BCX basic, and I'm really looking forward to adding Big Spin.

    One of the projects I'm working on behind the scenes is trying to split up pasm and spin code. The aim is to have pasm code in external memory (or loaded from sd card), and hence save 2k of hub ram. With the memory saved, it ought to be possible to use that hub for more useful things, like a stack, or a cache, or a video buffer.

    I think it will be a matter of taking a Big Spin program, precompiling but stripping out all the pasm code, compiling that separately using homespun/BST, then linking it back in with a pointer to a hub array that you pass through with "par".

    Maybe there is another way to do this? Ultimately, what would be very useful is to be able to add obex code without having to think too much about how much memory is being used.

    I'm testing this out in C and Basic at the moment - I think you might need some sort of compiler directive $external or whatever, to put before some pasm code and so determine where that pasm code gets stored.

    Ditto determining where arrays and variables get stored. Catalina does have a default mode where global arrays are in external memory and local arrays are in hub memory, and you can also trick the compiler to put global arrays in hub memory by putting the array in the "main" function.

    I'm trying to think of a way that could be done in Big Spin. Possibly just use the same rules?

    As an aside, my IDE is in VB.NET and I'm in discussions on the jabaco forum looking at trying to get a serial port to work on jabaco, as this might mean I could port the IDE into jabaco, which compiles to java and hence will run on Linux as well. Jabaco is VB6 which is similar to the way I write VB.NET. I'd like to be able to contribute something here, maybe in the way of an open source IDE/compiler that can handle Spin/C/Basic/Big Spin all in the one program.
  • David BetzDavid Betz Posts: 14,516
    edited 2011-02-23 08:13
    RossH wrote: »
    Hi Jazzed,

    I mean this code ...
    mov     fn, write
            call    #BSTART
    pw      rdlong  data, ptr
            mov     bits, #32
            call    #send
    
    ... and the corresponding read code. They both set the SPI RAM address (in BSTART), then transmit (or retrieve) the long to be stored as 32 successive bits starting with the most significant bit first. This results in the most-significant byte being stored in the lowest address (which auto increments because the SPI RAM has been set up in sequential mode).

    This is big-endian storage - i.e. the most significant byte is stored first (i.e. at the lowest address). Nothing wrong with this - provided it is always read it back the same way. But the Prop is a little-endian architecture, with the least significant byte stored first.

    This won't affect you if you only ever access the SPI RAM as reading/writing longs via the cache. But it broke my first attempt to load a program because my loaders do not use the cache - and naturally they store programs in SPI RAM the same way it would be stored in Hub RAM. Then they start the cache. Of course, I could change all my loaders - but in my view, it is better to always store programs the same way so that the use of the cache is optional.

    Ross.

    As you suggest, this is not what makes ZOG big-endian. That is all of the XORs with %11 and %10. My cache code only reads whole pages at a time so the order that the bits are written to SDRAM or flash is not relevant. It is if you bypass the cache though. I could certainly change my code to write the bytes in little-endian order if that would be helpful.
  • RossHRossH Posts: 5,462
    edited 2011-02-23 12:43
    David Betz wrote: »
    It is if you bypass the cache though. I could certainly change my code to write the bytes in little-endian order if that would be helpful.

    Hi David,

    It would be helpful - I have already changed the version I use. Catalina does not always access the serial RAM or FLASH via the cache. During the iniital program load, which is essentially just a long sequence of contiguous writes that you have no intention of re-reading, using the cache slows things down.

    Ross.
  • David BetzDavid Betz Posts: 14,516
    edited 2011-02-23 12:54
    Are your modifications backward compatible? If so, could you send me your modified code and I'll adopt it as the standard version?
  • RossHRossH Posts: 5,462
    edited 2011-02-23 14:42
    Hi David,

    I made some trivial changes in the BREAD and BWRITE routines just so I could continue testing ...
    BREAD
            test    vmaddr, bit20 wz
      if_nz jmp     #FLASH_READ
    
            mov     fn, read
            call    #BSTART
    
    BREAD_DATA
    read0
            mov     count2,#4
            mov     endian,#0
            
    read1   call    #spiRecvByte
            or      endian,data
            ror     endian,#8
            djnz    count2,#read1
            
            wrlong  endian, ptr
            add     ptr, #4
            djnz    count, #read0
    
            call    #deselect
    BREAD_RET
            ret
    
    
    ... and ...
    BWRITE
            test    vmaddr, bit20 wz
      if_nz jmp     BWRITE_RET
    
            mov     fn, write
            call    #BSTART
    
    pw      rdlong  endian, ptr
            mov     count2,#4
            
    pb      mov     data,endian
            call    #spiSendByte
            ror     endian,#8
            djnz    count2, #pb
            
            add     ptr, #4
            djnz    count, #pw
    
            call    #deselect
    BWRITE_RET
            ret
    
    fn      long    0
    
    endian  long    0
    count2  long    0
    
    
    However, I think these changes may cause your original version to exceed the 496 long limit (in my version I have disabled the SD Card stuff, so I don't have this problem).

    Ross.
  • David BetzDavid Betz Posts: 14,516
    edited 2011-02-23 14:52
    Thanks! I'll try to merge it into my version. Have you released a version that uses both SPI SRAM and SPI flash on the C3?
  • RossHRossH Posts: 5,462
    edited 2011-02-23 15:01
    Hi David,

    Not yet - I'm still struggling with the load process. I have to reorganize the various program segments when I store them in Flash, and then re-organize them just before I actually execute the program (e.g. to move the data segments from Flash into RAM). It's a pain in the proverbial!

    I hope I'll get time to finish this off over the coming weekend.

    Ross.
  • David BetzDavid Betz Posts: 14,516
    edited 2011-02-23 16:42
    Have you ever considered generating binaries that would work with the GNU linker? It handles all of that for me in ZOG. You can place .text and .data anywhere you want and you can also create an image of .data in the .text section and copy it into place at startup. That allows me to put the initial values of globals in flash and then move them to their place in SRAM just before calling main().
Sign In or Register to comment.