Big Spin - is it still a pipedream?

RossH · 2011-02-20 03:44

Hi Jazzed,

jazzed wrote: »

I've been scratching my head over what part of the cache interface would be big-endian specific.
Could you please snip an example here and explain what you mean?

I mean this code ...

mov     fn, write
        call    #BSTART
pw      rdlong  data, ptr
        mov     bits, #32
        call    #send

... and the corresponding read code. They both set the SPI RAM address (in BSTART), then transmit (or retrieve) the long to be stored as 32 successive bits starting with the most significant bit first. This results in the most-significant byte being stored in the lowest address (which auto increments because the SPI RAM has been set up in sequential mode).

This is big-endian storage - i.e. the most significant byte is stored first (i.e. at the lowest address). Nothing wrong with this - provided it is always read it back the same way. But the Prop is a little-endian architecture, with the least significant byte stored first.

This won't affect you if you only ever access the SPI RAM as reading/writing longs via the cache. But it broke my first attempt to load a program because my loaders do not use the cache - and naturally they store programs in SPI RAM the same way it would be stored in Hub RAM. Then they start the cache. Of course, I could change all my loaders - but in my view, it is better to always store programs the same way so that the use of the cache is optional.

Ross.

jazzed · 2011-02-20 07:58

Well that explains all my head scratching. I use the cache driver for BigSpin, so I never noticed. I've never looked at the internals of the C3 driver for more than a a few seconds. Just keep a little endian version of C3 for Catalina.

RossH · 2011-02-20 12:46

jazzed wrote: »

Well that explains all my head scratching. I use the cache driver for BigSpin, so I never noticed. I've never looked at the internals of the C3 driver for more than a a few seconds. Just keep a little endian version of C3 for Catalina.

Yes, I'll probably end up with my own version anyway for other reasons. Just keep it in mind, since it will cause problems if anyone uses the cache driver for purposes that need to interact with other Prop languages (e.g. to read/write data from the Flash RAM from both Spin and BigSpin).

Ross.

jazzed · 2011-02-20 12:55

RossH wrote: »

Yes, I'll probably end up with my own version anyway for other reasons. Just keep it in mind, since it will cause problems if anyone uses the cache driver for purposes that need to interact with other Prop languages (e.g. to read/write data from the Flash RAM from both Spin and BigSpin).

Ross.

Sure. I'll chat with David about the possibility of doing a little endian version.

RossH · 2011-02-22 02:24

All,

I just got the DracBlade version of the caching driver working with Catalina. The results are quite interesting:

Hardware        |  Language   |  FIBO(20) time  |  FIBO 0 to 26
----------------+-------------+-----------------+--------------
C3 (LMM)        |  Catalina C |  306ms          |  11s
C3 (Hub)        |  SPIN       |  547ms          |  30s
DB XMM cached   |  Catalina C |  1243ms         |  58s
DB XMM uncached |  Catalina C |  2659ms         |  2m6s
----------------+-------------+-----------------+--------------

The caching driver speeds up execution from external RAM by a factor of two - so Catalina can now execute programs hundreds of kilobytes in size on the DracBlade at speeds only around two times slower than SPIN executing from Hub RAM - and this is on a platform with fairly slow parallel RAM (sorry, Dr_A!).

I think that wih a bit more optimization, and on a platform with fast parallel RAM (perhaps using Jazzed's SDRAM card) it may eventually be possible to have C executing from external RAM at speeds quite comparable to SPIN executing from Hub RAM.

Thanks are due to Jazzed and David for a nice piece of work on the caching driver (and also to Bill Henning who did the original work on VMCOG, and to Dr_Acula who makes the DracBlade!)

Ross.

Cluso99 · 2011-02-22 02:43

I am curious Ross. Have you tried these on the TriBlade?

RossH · 2011-02-22 03:05

Cluso99 wrote: »

I am curious Ross. Have you tried these on the TriBlade?

No - not yet. A version of the caching driver was already written for both the C3 and the DracBlade so I didn't actually need to do very much to port it to the DracBlade once I had it working on the C3.

I thought I might try your RamBlade next - I think that's the fastest XMM platform I have.

The upside of the caching driver is execution speed, but the downside is that it consumes a lot of Hub RAM, and also an additional cog. So it won't suit all applications - but it is a useful option to have available. Eventually, I will add support for all the XMM platforms currently supported by Catalina.

Ross.

Heater. · 2011-02-22 03:14

And thanks also to me. Well a little anyway:)

As far as I remember the first code executed from external RAM on a Prop was achieved by using the Z80 emulator on a TriBlade which then provided the motivation for DracBlades etc. (Of course I may be wrong anyone have any earlier claims?).

No, I'm not really claiming any credit for todays achievements, it's just that it has been interesting to watch the development of ext RAM solutions since those early "CP/M on a Prop" days. Now we have Big Spin and "big C". Amazing.

RossH · 2011-02-22 03:23

Heater. wrote: »

And thanks also to me. Well a little anyway:)

As far as I remember the first code executed from external RAM on a Prop was achieved by using the Z80 emulator on a TriBlade which then provided the motivation for DracBlades etc. (Of course I may be wrong anyone have any earlier claims?).

Fair enough - some credit should also go to both you and Cluso!

Ross.

Cluso99 · 2011-02-22 03:56

While I built the TriBlade (and the HexBlade) to add external memory, I was not really the first. I later discovered IIRC Beau proposed a circuit, and of course the Hydra was out (which I also knew nothing about).

But the catalyst was heaters ZiCog which IIRC I discovered while doing the TriBlade. It was a sheer delight and a frenzy to get that CPM emulation running on the prop. While not much is happening on that front because of Zog and other OSes, I think it really started the whole ball rolling with what could really be achieved on the prop. And it gets better every day!!! It is just like Apple really got put on the map with SuperCalc.

We are still waiting for that proper elusive PropOS. So far, to me, Sphinx makes the best attempt at getting there. I just have not had the time to work on it much over the past year.

RossH: I am really interested in the differences between the RamBlade and TriBlade. I have a RamBlade II coming, so I am obviously quite curious. It's an add-on component in other boards coming.

jazzed · 2011-02-22 08:31

RossH wrote: »

The caching driver speeds up execution from external RAM by a factor of two - so Catalina can now execute programs hundreds of kilobytes in size on the DracBlade at speeds only around two times slower than SPIN executing from Hub RAM

Excellent news Ross. Is your DracBlade running at 80MHz?

I see I grossly underestimated the performance gain in my original prediction for DracBlade.

jazzed wrote: »

performance of applications on your board using a cache would most likely increase by more than 50%.

Here's a more complete table of FIBO* measurements:

Hardware        |  Language   |  RAM Model  | FIBO(20) time  |  FIBO 0 to 26
----------------+-------------+-------------+----------------+--------------
C3 (LMM)        |  Catalina C |  All HUB    | 306ms          |  11s
C3 (Hub)        |  SPIN       |  All HUB    | 547ms          |  30s
SDRAM (Hub)     |  SPIN       |  All HUB    | 547ms          |  30s
DB XMM cached   |  Catalina C |  S&D HUB    | 1243ms         |  58s
C3 XMM cached   |  Catalina C |  S&D HUB    | 1468ms         |  1m10s
DB XMM uncached |  Catalina C |  S&D HUB    | 2659ms         |  2m6s
SDRAM cached    |  ZOG C      |  All XMEM   | 2773ms         |  2m18s
uPropPC cached  |  BigSPIN    |  All XMEM   | 2834ms         |  2m17s
SDRAM cached    |  BigSPIN    |  All XMEM   | 2858ms         |  2m19s
C3 cached       |  BigSPIN    |  All XMEM   | 3601ms         |  2m53s
C3 cached       |  ZOG C      |  All XMEM   | 3644ms         |  3m18s
C3 XMM uncached |  Catalina C |  S&D HUB    | 7386ms         |  5m50s
----------------+-------------+-------------+----------------+--------------
RAM Models:
All HUB  = code,stack,data in HUB
All XMEM = code,stack,data in XMEM
C&D XMEM = code,data in XMEM
S&D HUB  = code only XMEM (XMM for LMM type interpreters)

*AGAIN, FIBO measurements are only good for relative value. The Heater FFT application when ported will provide more data points. Dave implemented Drhystone tests in SPIN, but it will not be a 100% apples-apples benchmark comparison, although it would provide some relative value; waiting for posted code for tests.

My next contributions in this arena have already been previewed as hardware SpinSocket-Flash and uPropPC (MicroPropPC), and I'm vigorously trying to finish verifying the uPropPC concept prototype today before re-spinning hardware this week.

The value ZOG brought to all this has been somewhat understated. I won't go into the details of why to keep flaming arrows out of the thread, but without ZOG, these results would have not happened.

It's great that the outcome of all our external memory experiments and competition have helped us realize greater things as a group than one or two individuals were able to achieve in isolation. It has come with some bruises and some questionable heritage attribution, but in the end (which surely has not been seen) we are all beneficiaries.

Dave Hein · 2011-02-22 08:56

jazzed wrote: »

Drhystone tests have not been implemented in SPIN.

Steve,

I have implemented the Dhrystone benchmark in Spin. Of course, the Dhrystone benchmark is a C benchmark, and should be used to compare the performance of C compilers and the target platforms. I used cspin to convert the Dhrystone program to Spin. The Spin program could be tweaked to produce better performance, but it would need to be understood that there was was hand tuning involved in that case.

Dave

jazzed · 2011-02-22 09:24

@Dave I was wondering about that (forgot to add AFAIK). Can you post your Spin implementation? If nothing else I can verify that it works on BigSpin. Thanks.

jazzed · 2011-02-22 11:27

@Ross, do you have numbers for C3 running code from SPI-Flash?
I presume your previous results were for SPI-RAM only.

Dave Hein · 2011-02-22 12:12

Steve,

Here's the Spin version of the Dhrystone benchmark. I have it set to use clib_conio to run under SpinSim. You can change that to use clib to run on the hardware. The LOOP parameter is set for 20,000 loops. That's located about halfway down the file if you want to change it.

Dave

RossH · 2011-02-22 14:15

@All,

What we have here is the fruits of both the collaborative effort and the good-natured competition that characterizes these forums.

@Cluso,

I will post both TriBlade and RamBlade figures when I have them. Shouldn't take too long since I already have working code to access the XMM RAM on these boards - I just need to plug this code into the caching XMM driver.

@Jazzed,

Yes, my DracBlade runs at 80Mhz. The first Flash-based model I am implementing on the C3 is to have stack in Hub RAM, all data segments in SPI and all code in Flash. If Flash reads are the same speed as SPI reads then the execution time will therefore be the same as the current speed. The main bit I don't have working yet is the loader necessary to organize all the memory segments correctly. As soon as I have that, I will release an update to Catalina.

Ross.

jazzed · 2011-02-22 16:27

Guess this is more of a benchmark thread thing, but eventually some flavor of BigSpin will run this too.

I get 476 DMIPS on an 80MHz Propeller running Dave's Spin Dhrystone(1.1) program.
This seems slightly better than an original IBM PC/XT running at 4.77MHz with any OS + compiler.

I get 625 DMIPS on a 104MHz Propeller.

RossH · 2011-02-22 16:51

jazzed wrote: »

Guess this is more of a benchmark thread thing, but eventually some flavor of BigSpin will run this too.

I get 476 DMIPS on an 80MHz Propeller running Dave's Spin Dhrystone(1.1) program.
This seems slightly better than an original IBM PC/XT running at 4.77MHz with any OS + compiler.

I get 625 DMIPS on a 104MHz Propeller.

I think you mean "Dhrystones/Second" (D/S), not DMIPS.

DMIPS = D/S/1757 (1757 = D/S result of a VAX 11/780).

So 625 DMIPS would be 625 times faster than a VAX 11/780, or faster than a 300Mhz Pentium II.

Ross.

jazzed · 2011-02-22 19:21

RossH wrote: »

I think you mean "Dhrystones/Second" (D/S), not DMIPS.

More coffee LOL. Thanks for straightening me out.

Dr_Acula · 2011-02-22 20:46

This is all very exciting!

Nothing wrong with some healthy competition. Indeed, the the harsher the comments/hotter the flames, the more productive the output *grin*.

I am still a bit confused about what is being compared with what.

Put Catalina in LMM on a C3 and that is not using the external memory, right? Neither is LMM on a Dracblade using external memory.

Take external memory models only. Those either need a large (>32k) program for testing, or specifically need to be coded such that they really are running from external memory. Caching clearly is faster than non caching.

But take a cached model, >32k program and run it on serial ram (C3), parallel ram with latches (dracblade) or parallel ram with direct access (Clusoblade). I would have thought the speed would increase as you move though those options.

Do we have a test showing this?

And maybe I'm confused here, but do we have a Big Spin program that is >32k yet?

RossH · 2011-02-22 21:11

Dr_Acula wrote: »

Put Catalina in LMM on a C3 and that is not using the external memory, right? Neither is LMM on a Dracblade using external memory.

Correct - we should probably take the "C3" off the LMM and SPIN entries, since the same times would be achieved on any propeller platform

Dr_Acula wrote: »

Take external memory models only. Those either need a large (>32k) program for testing, or specifically need to be coded such that they really are running from external memory.

Generally it's the latter - i.e. the programs could be run from Hub RAM, but are being explicitly compiled to run from external RAM instead

Dr_Acula wrote: »

But take a cached model, >32k program and run it on serial ram (C3), parallel ram with latches (dracblade) or parallel ram with direct access (Clusoblade). I would have thought the speed would increase as you move though those options.

Do we have a test showing this?

Soon we will have - Catalina will be directly comparable across all these platforms - but BigSpin may take a while longer.

Dr_Acula wrote: »

And maybe I'm confused here, but do we have a Big Spin program that is >32k yet?

No - and where would we find anyone masochistic enough to want to write one?

Ross.

jazzed · 2011-02-22 21:22

Dr_Acula wrote: »

Put Catalina in LMM on a C3 and that is not using the external memory, right? Neither is LMM on a Dracblade using external memory.

The HUB only performance numbers show baselines. It's a scientific method thingy.

Dr_Acula wrote: »

But take a cached model, >32k program and run it on serial ram (C3), parallel ram with latches (dracblade) or parallel ram with direct access (Clusoblade). I would have thought the speed would increase as you move though those options.

The speed does increase. There is more work to do to provide better comparisons.

Dr_Acula wrote: »

And maybe I'm confused here, but do we have a Big Spin program that is >32k yet?

Not just yet. I'm under pressure to finish some other things first. BigSpin is important so that *normal* Propeller users can write and run bigger programs. Not everyone wants to use C.

Dr_Acula · 2011-02-22 21:55

"Not just yet".

Ah, that is a very enticing answer.

Ross said "No - and where would we find anyone masochistic enough to want to write one? "

All my projects seem to outgrow the propeller memory!

Regarding different memory models, I have an IDE where you can compile to all sorts of different memory models. The aim is that you start with your standard model that fits in a propeller, but when it outgrows the memory, you can compile to a bigger model just by clicking a different button. This works for C and it works for BCX basic, and I'm really looking forward to adding Big Spin.

One of the projects I'm working on behind the scenes is trying to split up pasm and spin code. The aim is to have pasm code in external memory (or loaded from sd card), and hence save 2k of hub ram. With the memory saved, it ought to be possible to use that hub for more useful things, like a stack, or a cache, or a video buffer.

I think it will be a matter of taking a Big Spin program, precompiling but stripping out all the pasm code, compiling that separately using homespun/BST, then linking it back in with a pointer to a hub array that you pass through with "par".

Maybe there is another way to do this? Ultimately, what would be very useful is to be able to add obex code without having to think too much about how much memory is being used.

I'm testing this out in C and Basic at the moment - I think you might need some sort of compiler directive $external or whatever, to put before some pasm code and so determine where that pasm code gets stored.

Ditto determining where arrays and variables get stored. Catalina does have a default mode where global arrays are in external memory and local arrays are in hub memory, and you can also trick the compiler to put global arrays in hub memory by putting the array in the "main" function.

I'm trying to think of a way that could be done in Big Spin. Possibly just use the same rules?

As an aside, my IDE is in VB.NET and I'm in discussions on the jabaco forum looking at trying to get a serial port to work on jabaco, as this might mean I could port the IDE into jabaco, which compiles to java and hence will run on Linux as well. Jabaco is VB6 which is similar to the way I write VB.NET. I'd like to be able to contribute something here, maybe in the way of an open source IDE/compiler that can handle Spin/C/Basic/Big Spin all in the one program.

David Betz · 2011-02-23 08:13

RossH wrote: »
Hi Jazzed,

I mean this code ...
mov     fn, write
        call    #BSTART
pw      rdlong  data, ptr
        mov     bits, #32
        call    #send
... and the corresponding read code. They both set the SPI RAM address (in BSTART), then transmit (or retrieve) the long to be stored as 32 successive bits starting with the most significant bit first. This results in the most-significant byte being stored in the lowest address (which auto increments because the SPI RAM has been set up in sequential mode).

This is big-endian storage - i.e. the most significant byte is stored first (i.e. at the lowest address). Nothing wrong with this - provided it is always read it back the same way. But the Prop is a little-endian architecture, with the least significant byte stored first.

This won't affect you if you only ever access the SPI RAM as reading/writing longs via the cache. But it broke my first attempt to load a program because my loaders do not use the cache - and naturally they store programs in SPI RAM the same way it would be stored in Hub RAM. Then they start the cache. Of course, I could change all my loaders - but in my view, it is better to always store programs the same way so that the use of the cache is optional.

Ross.

As you suggest, this is not what makes ZOG big-endian. That is all of the XORs with %11 and %10. My cache code only reads whole pages at a time so the order that the bits are written to SDRAM or flash is not relevant. It is if you bypass the cache though. I could certainly change my code to write the bytes in little-endian order if that would be helpful.

RossH · 2011-02-23 12:43

David Betz wrote: »

It is if you bypass the cache though. I could certainly change my code to write the bytes in little-endian order if that would be helpful.

Hi David,

It would be helpful - I have already changed the version I use. Catalina does not always access the serial RAM or FLASH via the cache. During the iniital program load, which is essentially just a long sequence of contiguous writes that you have no intention of re-reading, using the cache slows things down.

Ross.

David Betz · 2011-02-23 12:54

Are your modifications backward compatible? If so, could you send me your modified code and I'll adopt it as the standard version?

RossH · 2011-02-23 14:42

Hi David,

I made some trivial changes in the BREAD and BWRITE routines just so I could continue testing ...

BREAD
        test    vmaddr, bit20 wz
  if_nz jmp     #FLASH_READ

        mov     fn, read
        call    #BSTART

BREAD_DATA
read0
        mov     count2,#4
        mov     endian,#0
        
read1   call    #spiRecvByte
        or      endian,data
        ror     endian,#8
        djnz    count2,#read1
        
        wrlong  endian, ptr
        add     ptr, #4
        djnz    count, #read0

        call    #deselect
BREAD_RET
        ret

... and ...

BWRITE
        test    vmaddr, bit20 wz
  if_nz jmp     BWRITE_RET

        mov     fn, write
        call    #BSTART

pw      rdlong  endian, ptr
        mov     count2,#4
        
pb      mov     data,endian
        call    #spiSendByte
        ror     endian,#8
        djnz    count2, #pb
        
        add     ptr, #4
        djnz    count, #pw

        call    #deselect
BWRITE_RET
        ret

fn      long    0

endian  long    0
count2  long    0

However, I think these changes may cause your original version to exceed the 496 long limit (in my version I have disabled the SD Card stuff, so I don't have this problem).

Ross.

David Betz · 2011-02-23 14:52

Thanks! I'll try to merge it into my version. Have you released a version that uses both SPI SRAM and SPI flash on the C3?

RossH · 2011-02-23 15:01

Hi David,

Not yet - I'm still struggling with the load process. I have to reorganize the various program segments when I store them in Flash, and then re-organize them just before I actually execute the program (e.g. to move the data segments from Flash into RAM). It's a pain in the proverbial!

I hope I'll get time to finish this off over the coming weekend.

Ross.

David Betz · 2011-02-23 16:42

Have you ever considered generating binaries that would work with the GNU linker? It handles all of that for me in ZOG. You can place .text and .data anywhere you want and you can also create an image of .data in the .text section and copy it into place at startup. That allows me to put the initial values of globals in flash and then move them to their place in SRAM just before calling main().

Big Spin - is it still a pipedream?

Comments