Big Spin - is it still a pipedream?

jazzed · 2011-02-08 15:54

Dave Hein wrote: »

The only special case where it isn't masked off is the system I/O address of $1234000X, which is used for conio and fileio.

I noticed this today. One could easily move the HUB access "segment" to $2000_0000 or some other address if necessary.

The only external memory interface requirement for read/write RAM are read buffer and write buffer of N size starting at some address. I suppose a read-only device like flash should also be simulated.

Is it possible to use a similar scheme for simulated external memory? That is $12340000+some offset = base could be a read/write address, base+4 the resulting data pointer, and base+8 the data length. A second memory device could be added to flash. I use the LSB of the address for the read/write flag since I typically use a 32byte block of data for the interface.

David Betz · 2011-02-08 18:27

jazzed wrote: »

The problem with C3 or any serial RAM is speed. The first time I booted an external memory software solution on C3 I thought the thing didn't load ... astonishingly a minute later the TV screen finally turned blue and printed "Hello, world!".

Was that with ZOG? I don't see performance that bad on my C3.

Dr_Acula · 2011-02-08 19:46

jazzed

Getting a DracBlade port running should be a high priority. I don't have hardware, but others do.

With all the amazing work you are doing here, I am thinking about sending you a freebie board in return. Can you pm me your postal address?

Dave Hein · 2011-02-09 07:44

jazzed wrote: »

Is it possible to use a similar scheme for simulated external memory? That is $12340000+some offset = base could be a read/write address, base+4 the resulting data pointer, and base+8 the data length. A second memory device could be added to flash. I use the LSB of the address for the read/write flag since I typically use a 32byte block of data for the interface.

Steve,

The SpinSim system I/O addresses are currently defined as follows:

con
  SYS_COMMAND = $12340000 'word - conio and fileio
  SYS_LOCKNUM = $12340002 'word - conio and fileio
  SYS_PARM    = $12340004 'long - conio and fileio
  SYS_DEBUG   = $12340008 'long - turns debug prints on/off
  SYS_INVALID = $1234000c 'long - invalid hub memory address accesses are routed here

I could change the debug flag to be a word and add a "SYS_EXTMEM" word at $1234000a, which would simulate the external memory control lines and byte address/data mailbox. The Prop program would need to clock in a memory address and read/write the data one byte at a time. I could make the interface identical to the modified HX512 card, except it would use the SYS_EXTMEM mailbox. Does that sound OK, or would you prefer the interface you suggested? If so, could you describe that a bit more? I don't quite understand how it would work.

Dave

jazzed · 2011-02-09 09:07

Dave, I'm not sure the simulator needs to be cycle accurate. It could though for plug-ins etc....

Here's what I'm doing now:

#define SYS_CACHELNLEN 	  32
#define SYS_XRAM_SIZE  	  33
#define SYS_XRAM_ADDR  	  34
#define SYS_XRAM_DATA  	  35
#define SYS_FLSH_SIZE  	  36
#define SYS_FLSH_ADDR  	  37
#define SYS_FLSH_DATA  	  38

// external memory interface
//
#define XRAM_MAX 64*1024*1024
static int cachelinelen = 0;
static uint32_t xmemdata = 0;
static uint32_t xramsize = 0;
static char *xrambuff = 0;
static uint32_t flshsize = 0;
static char *flshbuff = 0;

The idea is just to tell the simulator to malloc a memory block (once of course) at cache startup, then tell the simulator to write/read blocks of data that the SimCache.spin driver can use. I've added a flash block to potentially simulate the C3, but the interpreter will need a small re-write for address mappings.

Dave Hein · 2011-02-09 09:52

The easiest way for me to implement this would be to just add support for SYS_EXMEMREAD and SYS_EXMEMWRITE functions to the conio/fileio commands. These two commands would act like the file read and write commands, except they would access an external memory buffer instead of a file. I could add a -x# command-line option to specified the size of the external memory. Does that sound OK?

Dave

jazzed · 2011-02-09 12:50

Dave Hein wrote: »

The easiest way for me to implement this would be to just add support for SYS_EXMEMREAD and SYS_EXMEMWRITE functions to the conio/fileio commands. These two commands would act like the file read and write commands, except they would access an external memory buffer instead of a file. I could add a -x# command-line option to specified the size of the external memory. Does that sound OK?

Dave

As long as the memory is addressable I guess it doesn't matter. The design i'm using reads and writes a buffer at a time of a given length. for C3 simulation, two memory spaces are required.

We can cognew multiple cogs right? I'm having trouble with that ... it could just be my bug though.

Dave Hein · 2011-02-09 13:12

jazzed wrote: »

As long as the memory is addressable I guess it doesn't matter. The design i'm using reads and writes a buffer at a time of a given length. for C3 simulation, two memory spaces are required.

We can cognew multiple cogs right? I'm having trouble with that ... it could just be my bug though.

The external memory methods would be something like ExtMemRead(HubAddr, ExtAddr, NumBytes) and ExMemWrite(HubAddr, ExtAddr, NumBytes). The HubAddr would be between $0000 and $7FFF, and the ExtAddr would be between $00000000 and whatever the max address is for the external memory. The logical mapping of the external memory to some higher address space would be done by the caching routines. The caching routines would also need to manage the writebacks.

cognew works with all my tests. The program test.binary does several cognew's. If you are running without the "-p" option a cognew of Spin code, or a cognew at address $F004 will run the C version of the interpreter. A cognew of any other address or with the "-p" option will run PASM code instead.

Dave

jazzed · 2011-02-09 13:21

I'm hoping to recycle the cache code implemented in PASM.
Can i just wrlong val, addr where addr is $12340002 ?
I could do it in spin, but need another cog for that.

Dave Hein · 2011-02-09 14:19

$12340002 is actually the location for the system I/O lock number. The I/O command is at location $12340000 and a single long parameter is at $12340004. If the command requires more than one parameter $12340004 will then contain a pointer to an argument list. The Spin code and the PASM code are shown below. They compile, but I haven't tested them. I'll go ahead and add it to SpinSim, and I'll write a memory test program to check it out.

Dave

con
  SYS_COMMAND = $12340000
  SYS_LOCKNUM = $12340002
  SYS_PARM    = $12340004

  SYS_EXTMEM_READ   = 17
  SYS_EXTMEM_WRITE  = 18

pub ExtMemRead(HubAddr, ExtMemAddr, NumBytes)
  result := SystemCall(SYS_EXTMEM_READ, @HubAddr)

pub ExtMemWrite(HubAddr, ExtMemAddr, NumBytes)
  result := SystemCall(SYS_EXTMEM_Write, @HubAddr)

pri SystemCall(command, parm) | locknum
  locknum := word[SYS_LOCKNUM] - 1
  if locknum == -1
    return -1
    
  repeat until not lockset(locknum)
  long[SYS_PARM] := parm
  word[SYS_COMMAND] := command
  repeat while word[SYS_COMMAND]
  result := long[SYS_PARM]
  lockclr(locknum)

con
  SYS_COMMAND = $12340000
  SYS_LOCKNUM = $12340002
  SYS_PARM    = $12340004

  SYS_EXTMEM_READ   = 17
  SYS_EXTMEM_WRITE  = 18

dat
ExtMemRead              mov     ExtMemCommand, #SYS_EXTMEM_READ
                        jmp     #SystemCall

ExtMemWrite             mov     ExtMemCommand, #SYS_EXTMEM_WRITE

                        'Set up the parameter list in hub RAM
SystemCall              mov     temp, par
                        wrlong  HubAddr, temp
                        add     temp, #4
                        wrlong  ExtMemAddr, temp
                        add     temp, #4
                        wrlong  NumByte, temp

                        'Wait for lock not set
:loop1                  lockset locknum                       wc
        if_c            jmp     #:loop1
        
                        'Write the address of the parmeter list
                        wrlong  par, SysParm
                        
                        'Write the external memory command
                        wrword  ExtMemCommand, SysCommand
                        
                        'Wait for the command to be completed
:loop2                  rdword  ExtMemCommand, SysCommand     wz
        if_nz           jmp     #:loop2
        
                        'Clear the lock
                        lockclr locknum
ExtMemRead_ret
ExtMemWrite_ret         ret                     

HubAddr                 long    0
ExtMemAddr              long    0
NumByte                 long    0
ExtMemCommand           long    0
locknum                 long    0
temp                    long    
SysCommand              long    SYS_COMMAND
SysLocknum              long    SYS_LOCKNUM
SysParm                 long    SYS_PARM

jazzed · 2011-02-09 15:08

I was just editing my reply regarding that address

What you have looks OK for a start. I would like to see a way to set the memory size at startup if possible.

I would much rather "prearrange" the buffer length for performance, but if you have 10 external memories, you would need 10 lengths.

ExtMemSetMemorySize(BaseAddress, size)
ExtMemSetBufferSize(BaseAddress, size)
ExtMemWrite(addrp, datap, numbytes)
ExtMemRead(addrp, datap, numbytes)
ExtMemWriteBuffer(addrp, datap)
ExtMemReadBuffer(addrp, datap)

Thanks for adding whatever external memory access features you deem appropriate.
--Steve

jazzed · 2011-02-09 17:06

Dr_Acula wrote: »

jazzed

With all the amazing work you are doing here, I am thinking about sending you a freebie board in return. Can you pm me your postal address?

I may give you the address later. My free time and desk space are limited right now.

Dave Hein · 2011-02-09 18:42

I can add the functions you suggested, but let me make sure I understand them. Please look at the description below to make sure they're correct.

ExtMemSetMemorySize(BaseAddress, size)
Allocate an external memory segment of "size" bytes starting at address given by "BaseAddress".

ExtMemSetBufferSize(BaseAddress, size)
Set the buffer size for the memory segment starting at "BaseAddress" to "size".

ExtMemWrite(addrp, datap, numbytes)
Copy a block of data of size "numbytes" from hub RAM starting at "addrp" into a memory segment containing the starting address "datap".

ExtMemRead(addrp, datap, numbytes)
Copy a block of data of size "numbytes" to hub RAM starting at "addrp" from a memory segment containing the starting address "datap".

ExtMemWriteBuffer(addrp, datap)
Copy a block of data from hub RAM starting at "addp" into a memory segment containg the staring address "datap". The size of the block is determined by an earlier call to ExtMemSetBufferSize for that memory segment.

ExtMemReadBuffer(addrp, datap)
Copy a block of data to hub RAM starting at "addrp" from a memory segment containing the starting address "datap". The size of the block is determined by an earlier call to ExtMemSetBufferSize for that memory segment.

How many different memory segments do you think there should be? I don't understand the need for setting buffer sizes since the ExtMemRead and ExtMemWrite have an explicit buffer size parameter. I don't see how it makes it more efficient to pre-set the buffer size.

jazzed · 2011-02-09 21:42

That interpretation is reasonable. I guess addrp was supposed to be the physical backstore address (not a pointer) for a cache and datap would be the hub start address to read or write. It doesn't matter much as long as the parameter definitions are clear for the next guy/gal that comes along.

Dave Hein wrote: »

How many different memory segments do you think there should be? I don't understand the need for setting buffer sizes since the ExtMemRead and ExtMemWrite have an explicit buffer size parameter. I don't see how it makes it more efficient to pre-set the buffer size.

Two segments might get it, but I've seen separate packet memory before. Also, it's possible that someone might use part of SDRAM for a giant disk buffer.

Presetting a buffer size for a cache for example eliminates the need to do it more than once which gives performance advantage in a PASM cache swap driver.

Dave Hein · 2011-02-10 15:49

Steve,

I posted an update to SpinSim that supports up to four external memories. I added three system functions to allocate memory and read or write it to hub RAM. I wasn't clear on how your "Buffer" functions would work, so I didn't implement them. I think that functionality can be done in the caching code. Take a look at memtest.spin and extmem.spin. They demonstrate how external memory is accessed. I also included extmempasm.spin, which has the PASM routines to read and write external memory.

Dave

jazzed · 2011-02-10 16:18

Attached is a C3 version of the LittleBigSpin .zip I posted before. It still has the memory push-up hack, and requires BST/BSTC or Homespun to compile this time. Really I just want to post the working code before it gets lost some way. Also, since David uses the same interface for DracBlade as he does for C3, theoretically all you should need to do for DracBlade is change the #define at the top of the LittleBigSpin file and make a compatible hello.bin program (make a new userdefs.spin and change the u#TvPin). If it doesn't work, just look for changes around "#ifdef C3" and adjust. I've included copies of dracblade files from David's latest Zog.

Please use the hello_c3 .zip to create a hello_c3.bin and load it on your SDcard. That is the program that the LittleBigSpin interpreter will load/run.

@Dave. Thanks for adding all that code. I'll write a compliant PASM driver that uses your block copy methods.

jazzed · 2011-02-10 23:05

@Dave,

I'm having some trouble with my SPIN writeByte routine not detecting a command done = 0 handshake signal from PASM. I can see in your listing where 0 gets written to the command register, but writeByte waits forever. The problem only occurs after a cache line is swapped on a write-back to simulator external memory on the 33rd byte. On a "cache hit" the done = 0 handshake works with no problem. The code is attached if you want to have a look. The problem happens when trying to writeback the first buffer (see the SimCache.spin::flush routine) well before the SPIN interpreter gets replaced.

$ ./spinsim.exe LittleBigSpin.binary

Starting ... 448
Writing Cache ...
 0000 00 B4 C4 04 6F 24 10 00 C0 01 C8 01 1C 00 CC 01
 0010 30 00 02 01 0C 00 00 00 30 00 00 00 01 37 24 38

Please have a look.
The same algorithm works with real code of course.
Am I abusing something in the simulator?

Thanks.

BTW: Have you ever considered attaching a virtual serial port to your simulator?

Dave Hein · 2011-02-11 10:49

Steve,

I found some problem in the PASM code I wrote. It uses the PAR register to store the three parameters in memory, and I use a register called temp, which you also used for some code. Maybe that code at temp is only called once, so it's OK to overwrite it, but I changed my temp register to temp0. I also added a par0 value that points to it's image in hub RAM. It should be OK to re-use this area of hub RAM unless you reload the cog from it or use it for something else. I have attached a version of SimCache.spin with my changes. I've also included spinsim.c with some debug prints added. It looks like your code is never doing a writeback. It only reads from the external memory.

Dave

Edit: I also added a putch function to SimCache.spin to print out some debug characters. I print an "R" when we read external memory and a "W" when it writes. The "R" gets printed, but not the "W".

Dave Hein · 2011-02-11 13:03

OK, I understand why there weren't any cache writebacks. It's because the cache starts out empty and a writeback isn't needed until we need to fill the cache line with another chunk of memory.

Another problem I found is that SpinSim's pread never returns a value less than zero. The test "if result < 0" needs to be changed to "if result =< 0". I'll fix that in the next update. With that change it gets to the point where it prints the "Startup addresses".

jazzed wrote: »

BTW: Have you ever considered attaching a virtual serial port to your simulator?

Steve, what do you mean by a virtual serial port? Are you talking about a memory mapped register where you could read and write characters?

jazzed · 2011-02-11 16:21

Great David. Thanks. I'll give < 1 a try for terminating the pread loop later today. I've been out of the office.

I've noticed that many of our file objects have different expectations and I guess that return code is a subtle difference. It makes more since to me to check for <= 0 for end of file rather than < 0, and < 0 should probably mean some kind of an error.

I was thinking a virtual serial port like being able to download to the simulation and talk to it with a serial terminal. It's just an idea that would be a lot like using a real board from one of the GUIs. Guess it was just a crazy thought ... I don't expect you to take it seriously.

jazzed · 2011-02-13 13:38

The serial port is working now with BigSpin ... forgot to stop the loader cog before starting the new interpreter before. I'm not sure what it will take to make the simulation fully functional with the bigspin interpreter.

I guess the next step is resolving some code issues.

I'd like to see a windows GUI application to automate the build/download process.
Maybe the PZST Qt IDE can have a mode to deal with bigspin later.

I thought a fibo comparison would be interesting. Here are some FIBO* results at 80MHz:

Hardware  |  Language  |  FIBO(20) time  |  FIBO 0 to 26
----------+------------+-----------------+--------------
C3        |  SPIN      |  547ms          |  30s
SDRAM     |  SPIN      |  547ms          |  30s
C3        |  BigSPIN   |  3601ms         |  2m53s
SDRAM     |  BigSPIN   |  2858ms         |  2m19s
C3        |  ZOG C     |  3644ms         |  3m18s
SDRAM     |  ZOG C     |  2773ms         |  2m18s
----------+------------+-----------------+--------------

The numbers are interesting because ZOG and BigSpin use similar code, data, and stack storage and access methods where everything is cached on SDRAM. By this "weak" measure, BigSpin is 5 times slower than Spin and fractionally slower than ZOG. There is still room for performance improvement in the BigSpin interpreter. Language implementations that keep data and stack in HUB RAM will be faster.

*The fibo test is not very good for benchmarking, but it is a fair simple relative test for small code loops. Large programs will have very different results.

RossH · 2011-02-13 14:12

Hi jazzed,

Good stuff! I'll have to get a move on with Catalina's support for Flash RAM on the C3. Can you post the source you use for fibo for future reference?

Thanks!

Ross.

jazzed · 2011-02-13 16:40

RossH wrote: »

Can you post the source you use for fibo for future reference?

We all agree a fibo test is not a good benchmark, but it can be used to judge primitive relative value.

See attachments.

The spin code will run on either SPIN or BIGSPIN interpreters.
The zog code will run on C3, DracBlade, or PropellerPlatform SDRAM.

If you want to run the zog example, you will need the toolchain from here:
http://opensource.zylin.com/zpudownload.html

LittleBigSpin files are included which will load/run fibo.bin from SDCARD.
The DracBlade package is unknown but may work. Can someone please test it?

Dave Hein · 2011-02-13 19:14

Steve, I got the hello.bin program to work with the Big Spin interpreter under SpinSIm. There's a problem when using the SMALLER cache mode in the interpreter. It uses the cache line address immediately after the cache cog clears the command. However, the cache cog doesn't write the cache line address until three instructions later. The cache cog assumes that the calling cog will already have the address if the correct line is in the cache, so it clears the command early.

I moved the instruction that clears the command after the cache line address is written. This fixes the problem. I put XXXXXXXXXXXXXXXXXXXXXX lines around my changes. Of course the correct solution is to leave the cog cache code the way it is and fix the problem in the interpreter. I didn't try it with SMALLER undefined, so that version of the code may be correct.

I was a bit surprised to see the "Hello World" message print on the screen. I assumed the interpreter would try to map the $1234000x addresses to the cache. Do you make a special case for that address space, or are there certain addresses modes that you don't map to the cache, such as the absolute addressing mode that uses an absolute address from the stack?

The cog didn't terminate correctly. It ran to the end of memory and a debug print in SpinSIm show that is was addressing beyond the external memory. The interpreter normally terminates a program by inserting an $FFF9 return address on the stack. Of course in this case that would require having a cogstop(cogid) spin instruction at $FFF9 in external memory.

I attached the SimCache.spin file that I used. I also defined a 3-long area at the beginning with the label SysParms that is used for the three system function call parameters.

Dave

Dave Hein · 2011-02-13 20:05

I tried commenting out the "#define SMALLER" and the image was too large, so I'm guessing this mode doesn't work yet. I looked at your other versions for C3 and SDRAM and it seems like they would have the same problem as SpinSim. The cache line address will have an error every time you cross a cache line boundary. Maybe I'm missing something, but that's how it looks to me.

Dave

jazzed · 2011-02-13 20:34

Hey Dave.

Thanks for sleuthing that problem in SimCache.spin. I'm aware that what you found was possible, but I've never had a problem with it on real hardware, and just took it for granted. I believe the reason it works as well as it does on the chip is the cog sequence. It is a nice optimization, so it would be hard for me to change it.

It's great to have another functional platform. Having the simulator working is especially sweet since no hardware is required for general software development. Maybe someday a nicely thought out set of virtual devices could be added to the simulator via a generic interface - gear supports some virtual devices but how that happens is mysterious and as such a dead end.

Yes, #define SMALLER is the only way the cache works right now. At some point the other one can be done I hope because it is FASTER - up to 15% faster in some measurements with ZOG.

The reason you saw "Hello World" is because any address >= $10000000 is interpreted as a HUB address in the interpreter. That's the shared memory mechanism.

Now, if we could only get a DracBlade volunteer

Thanks.
--Steve

Heater. · 2011-02-13 23:51

BigSpin and Zog perform a fibo(20) in about 3 seconds. RossH has posted a result for Catalina's fibo(20) on a DracBlade as a tad less tan 1 second.

Should we all pack up and go home:)

Mind you I believe that Catalina result is with stack/data in HUB RAM.

RossH · 2011-02-14 01:22

Heater. wrote: »

BigSpin and Zog perform a fibo(20) in about 3 seconds. RossH has posted a result for Catalina's fibo(20) on a DracBlade as a tad less tan 1 second.

Should we all pack up and go home:)

Mind you I believe that Catalina result is with stack/data in HUB RAM.

Catalina's stack is always in Hub RAM. The data is in Hub RAM when using either the LMM memory model or the XMM SMALL memory model. The data is in XMM RAM when using the XMM LARGE model.

I'm not really sure what Jazzed's benchmarks figures mean. I presume "C3" mean executing from SPI RAM on the C3? If so, those times for Zog and BigSpin look pretty good. But SDRAM means ... what?

Anyway, I just did a quick test with Catalina. On the C3 executing from Hub RAM, FIBO 20 is 306ms, and FIBO 0 to FIBO 26 is about 11 seconds. Faster than Spin, but not lightning fast. This is due to the recursive nature of the FIBO benchmark - it doesn't really matter what language your're using - stack manipulation still takes about the same amount of time.

But I wouldn't pack up and go home just yet if I were you ... when I tried Catalina on the C3 executing from SPI RAM, I get FIBO 20 of 7386ms, and FIB0 to FIB 26 of 5m50s! Groan!

I presume having a caching SPI driver makes the big difference here - it seems to double the executon speed! I'll need to sharpen my pencils and get back to work on my own caching Catalina SPI driver.

Ross.

Heater. · 2011-02-14 01:59

RossH,

Seems to me that both BigSpin and Zog need an option to get stack into HUB at least. Given that most micro.controller code is not expected to be massively recursive or have huge local variables that would be fine.

Data in HUB is more problematic if you want "big" programs.

But SDRAM means ... what?

Well, there is the Gandet Ganster 32MB SDRAM board with SDRAM cache interface by Jazzed. As far as I can tell that result is obtained with all code/data/stack out there in the 32MB.

...a quick test with Catalina. Faster than Spin, but not lightning fast. This is due to the recursive nature of the FIBO benchmark

Yep. this fibo thing really needs to shot in the head. It is totally unrepresentative of what software in a typical mcu-application does. It is really only a good test of subroutine calling efficiency.

We need something with some more normal loops and conditionals and a selection of operators in use.

It did cross my mind that as I now have implementations of my FFT in C, Spin and PASM that this would make a better benchmark. It is a substantial piece of code with a good selection of loops and operators that takes a nice time to execute. All three versions are written to be as similar as possible in approach and so are more or less directly equivalent line for line. Well not so much the PASM of course, especially since Lonesock optimized it, but that need not concern us. The FFT is a bit short "if" statements though.

...when I tried Catalina on the C3 executing from SPI RAM, I get FIBO 20 of 8237ms,

Oh goodie, the race is still on:)

Dave Hein · 2011-02-14 05:32

jazzed wrote: »

Thanks for sleuthing that problem in SimCache.spin. I'm aware that what you found was possible, but I've never had a problem with it on real hardware, and just took it for granted. I believe the reason it works as well as it does on the chip is the cog sequence. It is a nice optimization, so it would be hard for me to change it.

I see how the hub access stalls will prevent the problem on the real hardware. I guess I need to add a cycle-accurate mode in the simulator. I currently don't simulate instruction pipe-lining or the hub access time slots.

Big Spin - is it still a pipedream?

Comments