Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Heater. · 2010-11-16 06:47

Given that the code from _start should look like this:

_start():
../../../../gcc/libgloss/zpu/crt_io.c:91
   0:   0b              nop
   1:   0b              nop
   2:   0b              nop
   3:   0b              nop
   4:   82              im 2
   5:   70              loadsp 0
   6:   0b              nop
   7:   0b              nop
   8:   0b              nop
   9:   94              im 20
   a:   d4              im -44
   b:   0c              store
   c:   3a              config
   d:   0b              nop
   e:   0b              nop
   f:   0b              nop
  10:   92              im 18
  11:   81              im 1
  12:   04              poppc

It looks like an endian issue. At least the instructions at
100024 to 100027 are in reverse. After that it looks a bit odd.

David Betz · 2010-11-16 06:48

I got rid of the byte swapping of the binary file and I get a bit further. Unfortunately, it still ends up in the weeds eventually when it tries to branch to zero. I'm not sure why that is happening. Here is the single step trace.

Starting SD driver...0000FFFF
Mounting SD...00000000
start:      00100020
data_image: 00103970
data_start: 00000010
data_end:   00000810
bss_start:  00000810
bss_end:    00000824

pc       op sp       tos      reason
00100020 0B 0000FFF8 80FF0204
00100021 0B 0000FFF8 80FF0204
00100022 0B 0000FFF8 80FF0204
00100023 0B 0000FFF8 80FF0204
00100024 82 0000FFF8 80FF0204
00100025 70 0000FFF4 00000002
00100026 0B 0000FFF0 00000002
00100027 0B 0000FFF0 00000002
00100028 0B 0000FFF0 00000002
00100029 0B 0000FFF0 00000002
0010002A A4 0000FFF0 00000002
0010002B 0C 0000FFEC 00000024
0010002C 3A 0000FFF4 00000002
0010002D 0B 0000FFF8 80FF0204
0010002E 80 0000FFF8 80FF0204
0010002F C0 0000FFF4 00000000
00100030 E8 0000FFF4 00000040
00100031 DC 0000FFF4 00002068
00100032 04 0000FFF4 0010345C
0010345C FD 0000FFF8 80FF0204
0010345D 3D 0000FFF4 FFFFFFFD
0010345E 0D 0000FFF4 0000FFE8
0010345F 80 0000FFE8 00000001
00103460 0B 0000FFE4 00000000
00103461 A4 0000FFE4 00000000
00103462 08 0000FFE0 00000024
00103463 54 0000FFE0 00000002
00103464 54 0000FFE4 00000000
00103465 72 0000FFE8 00000001
00103466 81 0000FFE4 00000002
00103467 2E 0000FFE0 00000001
00103468 9C 0000FFE4 00000000
00103469 38 0000FFE0 0000001C
0010346A 73 0000FFE8 00000001
0010346B 90 0000FFE4 00000000
0010346C A0 0000FFE0 00000010
0010346D 0C 0000FFE0 00000820
0010346E FE 0000FFE8 00000001
0010346F DE 0000FFE4 FFFFFFFE
00103470 3F 0000FFE4 FFFFFF5E
001033CE A0 0000FFE4 00103471
001033CF 08 0000FFE0 00000020
001033D0 80 0000FFE0 FFFFFFFF
001033D1 2E 0000FFDC 00000000
001033D2 A0 0000FFE0 00000000
001033D3 38 0000FFDC 00000020
001033D4 A4 0000FFE4 00103471
001033D5 08 0000FFE0 00000024
001033D6 82 0000FFE0 00000002
001033D7 2E 0000FFDC 00000002
001033D8 B9 0000FFE0 00000001
001033D9 38 0000FFDC 00000039
00103412 80 0000FFE4 00103471
00103413 C0 0000FFE0 00000000
00103414 A8 0000FFE0 00000040
00103415 80 0000FFE0 00002028
00103416 8C 0000FFE0 00101400
00103417 0B 0000FFE0 080A000C
00103418 0B 0000FFE0 080A000C
00103419 0B 0000FFE0 080A000C
0010341A 0B 0000FFE0 080A000C
0010341B 90 0000FFE0 080A000C
0010341C 90 0000FFDC 00000010
0010341D 0C 0000FFDC 00000810
0010341E 80 0000FFE4 00103471
0010341F C0 0000FFE0 00000000
00103420 A8 0000FFE0 00000040
00103421 80 0000FFE0 00002028
00103422 94 0000FFE0 00101400
00103423 0B 0000FFE0 080A0014
00103424 90 0000FFE0 080A0014
00103425 94 0000FFDC 00000010
00103426 0C 0000FFDC 00000814
00103427 80 0000FFE4 00103471
00103428 C0 0000FFE0 00000000
00103429 F1 0000FFE0 00000040
0010342A C0 0000FFE0 00002071
0010342B 0B 0000FFE0 001038C0
0010342C 90 0000FFE0 001038C0
0010342D 98 0000FFDC 00000010
0010342E 0C 0000FFDC 00000818
0010342F 04 0000FFE4 00103471
00103471 FF 0000FFE8 00000001
00103472 BF 0000FFE4 FFFFFFFF
00103473 97 0000FFE4 FFFFFFBF
00103474 8B 0000FFE4 FFFFDF97
00103475 3F 0000FFE4 FFEFCB8B
00000000 80 0000FFE4 00103476
00000001 FF 0000FFE0 00000000
00000002 02 0000FFE0 0000007F

David Betz · 2010-11-16 09:22

David Betz wrote: »

I got rid of the byte swapping of the binary file and I get a bit further. Unfortunately, it still ends up in the weeds eventually when it tries to branch to zero. I'm not sure why that is happening. Here is the single step trace.]

I just figured out a bit of additional information. This seems to get lost when executing a 'callpcrel' instruction. Is there any reason why I might have trouble with that instruction?

David Betz · 2010-11-16 09:28

David Betz wrote: »

I just figured out a bit of additional information. This seems to get lost when executing a 'callpcrel' instruction. Is there any reason why I might have trouble with that instruction?

Never mind this question. The problem seems to be that the callpcrel instruction is trying to call _init which the linker has placed at location zero for some reason. I'll have to track down why that is happening. Any idea where the source to _init is?

Heater. · 2010-11-16 10:36

I can't for the life of me find out were _init comes from.

I was already trying to find it a week ago when I was trying to link with customized newlib.

Anyway looking at my fibo listing I see _init is sitting in a ".init" section all of it's own that is not catered for in your linker script. Perhaps it just needs adding to the ROM area.

David Betz · 2010-11-16 10:44

Thanks for the suggestion. I tried adding the .init input section to the .text output section and that created some bizzare error messages about .data and .bss not being in ram. Looks like I have a bit more work to do...

Heater. · 2010-11-16 12:14

How about overriding the _premain function in your fibo test and just drop the call to _init? It can also set _use_syscall and such like.

Like so:

int _premain()
{
        int t;
        _use_syscall=1;
        _initIO();
        t=main(1, 0);
        exit(t);
        for (;;);
}

That's assuming _init does not actually do anything useful for us.

David Betz · 2010-11-16 12:50

Heater. wrote: »

I can't for the life of me find out were _init comes from.

I'm not sure if _init comes from the same place but all of the .init and .fini sections come from a file called toolchain/gcc/gcc/crtstuff.c that gets used to build crtbegin.o and crtend.o.

David Betz · 2010-11-16 18:57

I finally got code running from the SPI flash on the C3! This was after lots of mucking around with the linker script. Here are the fibo results. Speed isn't very impressive but I can now run up to 1MB of code with 64K of data on a C3! Even though the SD card gets mounted at the start of this run, it isn't actually needed. The code can run from previously programmed flash. Another nice thing about this is that I now have enough code space for an interactive basic bytecode compiler and vm and probably also an editor if I can get one written.

Starting SD driver...0000FFFF
Mounting SD...00000000
Opening ZPU image 'fibo.zbn'...00000000
Reading image...16904 bytes
start:      00100020
data_image: 001039F0
data_start: 00000000
data_end:   00000818
bss_start:  00000818
bss_end:    00000830
fibo(00) = 000000 (00000ms)
fibo(01) = 000001 (00000ms)
fibo(02) = 000001 (00000ms)
fibo(03) = 000002 (00000ms)
fibo(04) = 000003 (00001ms)
fibo(05) = 000005 (00002ms)
fibo(06) = 000008 (00003ms)
fibo(07) = 000013 (00005ms)
fibo(08) = 000021 (00009ms)
fibo(09) = 000034 (00015ms)
fibo(10) = 000055 (00024ms)
fibo(11) = 000089 (00039ms)
fibo(12) = 000144 (00064ms)
fibo(13) = 000233 (00104ms)
fibo(14) = 000377 (00168ms)
fibo(15) = 000610 (00273ms)
fibo(16) = 000987 (00443ms)
fibo(17) = 001597 (00721ms)
fibo(18) = 002584 (01170ms)
fibo(19) = 004181 (01891ms)
fibo(20) = 006765 (03050ms)
fibo(21) = 010946 (04917ms)
fibo(22) = 017711 (07931ms)
fibo(23) = 028657 (12808ms)
fibo(24) = 046368 (20709ms)
fibo(25) = 075025 (33535ms)
fibo(26) = 121393 (13427ms)

pc       op sp       tos      reason
0010351B 00 0000FFB8 001033F3 BREAKPOINT

jazzed · 2010-11-16 20:13

David Betz wrote: »

I finally got code running from the SPI flash on the C3! This was after lots of mucking around with the linker script. Here are the fibo results. Speed isn't very impressive but I can now run up to 1MB of code with 64K of data on a C3!

Congratulations David!

You just have to tell me one way or another how many seconds it takes to run all of fibo from the text "Opening ZPU" to the end at "BREAKPOINT." Please.

I've been working on a TV demo on and off. Hopefully, we can merge efforts. I really want to have a reasonable GUI running soon. I have a few days between boards to get serious with that again.

David Betz · 2010-11-16 20:19

As it turns out, I'm still having some trouble with the _init function that is called as part of the runtime library initialization. I commented it out to get the fibo run that I posted earlier. When I put it back into the build, I end up crashing at location zero. It seems that something gets compiled into the _init function that ends up trying to call a function at zero. I'm still trying to figure out how code gets added to _init.

UPDATE: I figured out why I was having a problem with _init. I had forgotten to place the .ctors and .dtors sections in the initialized RAM. All is working well now.

RossH · 2010-11-16 21:07

Congratulations David!

I notice your number for fibo(26) looks a little odd. Presumably this has wrapped around?

Ross.

Bill Henning · 2010-11-16 21:35

Excellent work David!

PropCade can use SPI Flash too

David Betz wrote: »
I finally got code running from the SPI flash on the C3! This was after lots of mucking around with the linker script. Here are the fibo results. Speed isn't very impressive but I can now run up to 1MB of code with 64K of data on a C3! Even though the SD card gets mounted at the start of this run, it isn't actually needed. The code can run from previously programmed flash. Another nice thing about this is that I now have enough code space for an interactive basic bytecode compiler and vm and probably also an editor if I can get one written.
Starting SD driver...0000FFFF
Mounting SD...00000000
Opening ZPU image 'fibo.zbn'...00000000
Reading image...16904 bytes
start:      00100020
data_image: 001039F0
data_start: 00000000
data_end:   00000818
bss_start:  00000818
bss_end:    00000830
fibo(00) = 000000 (00000ms)
fibo(01) = 000001 (00000ms)
fibo(02) = 000001 (00000ms)
fibo(03) = 000002 (00000ms)
fibo(04) = 000003 (00001ms)
fibo(05) = 000005 (00002ms)
fibo(06) = 000008 (00003ms)
fibo(07) = 000013 (00005ms)
fibo(08) = 000021 (00009ms)
fibo(09) = 000034 (00015ms)
fibo(10) = 000055 (00024ms)
fibo(11) = 000089 (00039ms)
fibo(12) = 000144 (00064ms)
fibo(13) = 000233 (00104ms)
fibo(14) = 000377 (00168ms)
fibo(15) = 000610 (00273ms)
fibo(16) = 000987 (00443ms)
fibo(17) = 001597 (00721ms)
fibo(18) = 002584 (01170ms)
fibo(19) = 004181 (01891ms)
fibo(20) = 006765 (03050ms)
fibo(21) = 010946 (04917ms)
fibo(22) = 017711 (07931ms)
fibo(23) = 028657 (12808ms)
fibo(24) = 046368 (20709ms)
fibo(25) = 075025 (33535ms)
fibo(26) = 121393 (13427ms)

pc       op sp       tos      reason
0010351B 00 0000FFB8 001033F3 BREAKPOINT

David Betz · 2010-11-16 22:19

RossH wrote: »

Congratulations David!

I notice your number for fibo(26) looks a little odd. Presumably this has wrapped around?

Ross.

You mean the time for fibo(26)? Yes, this memory is so slow that the time wraps around. I'm using a direct mapped cache and hooking into ZOG using the same protocol as jazzed uses with his SdramCache.spin driver. What is needed to get Catalina C working with another memory driver? I guess all of your memory access code is inline with your LMM runtime code, right? Have you tried interfacing with SdramCache or VMCOG or some other extended memory driver like that?

RossH · 2010-11-17 00:29

David Betz wrote: »

What is needed to get Catalina C working with another memory driver? I guess all of your memory access code is inline with your LMM runtime code, right? Have you tried interfacing with SdramCache or VMCOG or some other extended memory driver like that?

Right - anything else and you lose too much speed. I started doing some design work on a caching SPI driver optimized for the access pattern of a typical Catalina program which would execute in another cog - I was hoping clever caching would help compensate for the slow SPI access time. However, I didn't think I would be able to finish it in time for the C3 release, so I stopped. As it turns out, I probably would have had time, but c'est la vie - I've spent the time doing other things instead.

Now, it might be simpler just to interface with VMCOG and live with the performance hit. This is also on my "to do" list, but I'd be perfectly happy for someone else to do it.

Ross.

Heater. · 2010-11-17 01:39

David,

Well done. That's fantastic.

I had forgotten to place the .ctors and .dtors

Much head slapping going on here. It occurred to me to suggest looking for sections missing from the linker script with constructors/destructors in mind. Then, well, it went out of my mind.

Speed isn't very impressive but

Actually I think it is. Just looking around for fibo(20) results:

ROM on C3 - 3050ms
Jazzed's 32MB RAM - 2376ms
Catalina on a RAMBLADE - 921ms

Not so shabby and anyway fibo is a really bad test for this as described many times before.

I can now run up to 1MB of code with 64K of data on a C3!

Now here I'd missed a point about your C3 memory map. Am I correct in understanding that you are running fibo with data and stack in an external RAM? If so the speed is even more impressive. Looks like the C3 is another board I need around here.

In your linkers script you had a memory map like so:

  ram : ORIGIN = 0x00000000, LENGTH = 64K
  rom : ORIGIN = 0x00100000, LENGTH = 1M
  hub : ORIGIN = 0x10000000, LENGTH = 64K
  cog : ORIGIN = 0x10008000, LENGTH = 2K
  io  : ORIGIN = 0x10008800, LENGTH = 1K

Did you fix up Zog to handle all those different areas?

I'm starting to think we are going to need a family of different Zogs to handle all the different memory schemes.

David Betz · 2010-11-17 05:10

Heater. wrote: »
In your linkers script you had a memory map like so:
  ram : ORIGIN = 0x00000000, LENGTH = 64K
  rom : ORIGIN = 0x00100000, LENGTH = 1M
  hub : ORIGIN = 0x10000000, LENGTH = 64K
  cog : ORIGIN = 0x10008000, LENGTH = 2K
  io  : ORIGIN = 0x10008800, LENGTH = 1K
Did you fix up Zog to handle all those different areas?

No, ZOG still thinks that everything starting at zero up until the hub memory at 0x10000000 is RAM managed by the cache driver. I determine internal to the cache driver that addresses starting at 0x00100000 should be mapped to the flash. Actually, this is just an extension of what I already had to do to map different address ranges to the two SRAM chips. The cache code doesn't handle the flash any differently than another big SRAM chip except that it ignores writes to that space.

I'm starting to think we are going to need a family of different Zogs to handle all the different memory schemes.

I setup that memory map because I thought it matched what ZOG was already doing. I'm happy to change my map to follow whatever standard you define.

I'm hoping to tackle modifying the syscall code so I can have the stdio functions open, close, read, write, and lseek call the dosfs.c code directly rather than sending messages back to the SPIN COG. In fact, I'm hoping to start using your run_zog.spin program instead of debug_zog.spin so I can take over all of the COGs. How did you handle the COGINIT instruction to allow ZOG programs to start other COGs without having to go back to SPIN code?

jazzed · 2010-11-17 07:08

Heater. wrote: »

Just looking around for fibo(20) results:

ROM on C3 - 3050ms
Jazzed's 32MB RAM - 2376ms
Catalina on a RAMBLADE - 921ms

My latest SDRAM cache fibo(20) calculation time is around 2060ms.

FIBO is a bad test if calculation time is the only metric considered. Overall test time is just as important, and I suspect it is dismal in all cases because of iprintf and other functions memory distribution. I'll look at the dhrystone timing again later.

David Betz · 2010-11-17 07:11

jazzed wrote: »

My latest SDRAM cache fibo(20) calculation time is around 2060ms.

FIBO is a bad test if calculation time is the only metric considered. Overall test time is just as important, and I suspect it is dismal in all cases because of iprintf and other functions memory distribution. I'll look at the dhrystone timing again later.

It could be that my fibo numbers are wrong anyway because I've never modified the source to match my clock speed. Isn't the default setup for 100mhz operation? I'm running my C3 at the standard 80mhz clock rate.

jazzed · 2010-11-17 07:51

David Betz wrote: »

It could be that my fibo numbers are wrong anyway because I've never modified the source to match my clock speed. Isn't the default setup for 100mhz operation? I'm running my C3 at the standard 80mhz clock rate.

Yes, good observation. I never changed the fibo clock constants either until now.

With the adjustment the sum of the fibo calculation times is very close to the total run time. Calculated is 126943s or about 2m7s -vs- actual 2m13s.

I assume the missing 6 seconds is attributed to startup time, program printing, and cache swap overhead.

Cache swap overhead is a big problem with larger buffers. No doubt it is also an issue with VMCOG.

#define XTAL_FREQUENCY  5000000
#define PLL16X          16
#define CLOCK_FREQUENCY (XTAL_FREQUENCY * PLL16X)

Starting SD driver...0000FFFF
Mounting SD...00000000
Opening ZPU image 'fibo.bin'...00000000
Reading image...17056 bytes
Clearing bss: ....
fibo(00) = 000000 (00000ms)
fibo(01) = 000001 (00000ms)
fibo(02) = 000001 (00000ms)
fibo(03) = 000002 (00000ms)
fibo(04) = 000003 (00001ms)
fibo(05) = 000005 (00001ms)
fibo(06) = 000008 (00003ms)
fibo(07) = 000013 (00005ms)
fibo(08) = 000021 (00008ms)
fibo(09) = 000034 (00013ms)
fibo(10) = 000055 (00021ms)
fibo(11) = 000089 (00035ms)
fibo(12) = 000144 (00057ms)
fibo(13) = 000233 (00093ms)
fibo(14) = 000377 (00150ms)
fibo(15) = 000610 (00243ms)
fibo(16) = 000987 (00394ms)
fibo(17) = 001597 (00637ms)
fibo(18) = 002584 (01032ms)
fibo(19) = 004181 (01670ms)
fibo(20) = 006765 (02702ms)
fibo(21) = 010946 (04372ms)
fibo(22) = 017711 (07075ms)
fibo(23) = 028657 (11447ms)
fibo(24) = 046368 (18522ms)
fibo(25) = 075025 (29970ms)
fibo(26) = 121393 (48492ms)

pc       op sp       tos      reason
#----------
000034D1 00 01FFFFB8 00003822 
real	2m13.230s
user	0m0.000s
sys	0m0.004s

Heater. · 2010-11-18 04:36

David,

I guess it makes sense that any division of the memory MAP between different external devices is handled by the VMCOG/CACHE driver COG rather than Zog itself.

I have no plans to change the memory map at this moment.

I'm hoping to tackle modifying the syscall code so I can have the stdio functions open, close, read, write, and lseek call the dosfs.c code directly rather than sending messages back to the SPIN COG

I presume you mean the syscalls.c module in libgloss not the syscall handler in Zog.

So we end up with: Application => libgloss => dosfs => SD block driver.

What to do about that lowest level, the SD block driver?

Ideally that would be a stand alone PASM SD driver that could be used from C, dosfs in this case, through a HUB memory mailbox/buffer interface.

I'm hoping to start using your run_zog.spin program instead of debug_zog.spin so I can take over all of the COGs. How did you handle the COGINIT instruction to allow ZOG programs to start other COGs without having to go back to SPIN code?

Ah. This is a bit complicated...

If you look in the lib directory in zog you will find propeller.c. In there a few spin like functions have been defined. Significant here is cognew() which loads and starts a COG from a binary blob.

You will find FullDuplexSerialPlus.cpp and VMCog.cpp which are C++ implementations of the respective Spin objects. They use cognew() to get their PASM blobs started.

All this is built into a library libzog.a and a test program "test_libzog". It is that test I have had running under run_zog using all of HUB RAM or with the possibility to continue running some Spin code.

Where do the FDX and VMCOG PASM binary blobs come from?

They are compiled and extracted from the respective Spin modules using BSTC.

Then they are corrected for endiannes with objcopy, and then converted to ELF object files again with objcopy. These objects are then linked into the test_libzog program from which they can be loaded via conew. Have a look in the Makefile to see how it all hangs together.

Phew...

This is all by way of an experimental hack but it does show what can be done. This works from HUB RAM and needs a bit of effort to be able to use it from EXT memory.

I would like to think that an SD block driver could be added to this scheme. The float32 cog is working this way already.

Now, what would be better is to not have the PASM blobs taking up space in the ZPU HUB space.

My simple approach would be:

a) run_zog loads and starts cogs for FullDuplexSerial, VMCog/CACHE_RAM, SD block, Float32, video, keyboard. What ever a particular app needs. Theses devices have memory interfaces/buffers defined somewhere in high HUB.

b) run_zog the starts a zog interpreter on the applications C code. Moving it to address zero first.

c) The C code runs and uses the available hardware drivers through the high HUB memory interfaces.

d) The C code would be able to start a Zog running ZPU code in external memory.

Perhaps you have a better plan.

David Betz · 2010-11-18 05:56

Heater. wrote: »

I presume you mean the syscalls.c module in libgloss not the syscall handler in Zog.

So we end up with: Application => libgloss => dosfs => SD block driver.

What to do about that lowest level, the SD block driver?

Ideally that would be a stand alone PASM SD driver that could be used from C, dosfs in this case, through a HUB memory mailbox/buffer interface.

Yes, that's what I'd like to get to. At the moment my application calls dosfs directly and then hijacks the read/write syscalls to interface with the fsrw driver through your SPIN syscalls handler to do raw sector I/O. Unfortunately, this means I can't use the normal open/close/read/write/lseek functions in my C code to handle file I/O. I have to call dosfs directly. I will probably attempt to access the mailbox interface to my SPI driver directly from C code. Before I do that I have to add SD support to go along with my SPI flash and SPI SRAM support.

Ah. This is a bit complicated...

Thanks for the overview of how things work today. I may just stick with debug_zog.spin for the moment since my first priority is to get my basic system running on the C3.

Perhaps you have a better plan.

No, I'm just fumbling along at the moment try to achieve a short term goal of getting stuff running on the C3. I'll let you do all of the deep thinking! :-)

David Betz · 2010-11-18 18:39

Bad news...

Now that I have GCC/ZOG running on the C3 using the SPI flash for code and the SPI SDRAM for data I was finally able to compile my simple basic bytecode compiler and interpreter to run on the C3. The good news is that it works fine. The bad news is that it is unbearably slow. It isn't really usable. I'm not sure if this is because of the bad performance of my cache code or if it is just that running interpreted C code from external RAM doesn't have good enough performance for my application. I'm not sure what I'll try next...

Heater. · 2010-11-19 00:58

That's a shame.

Which part is unusable, is it the compilation to byte code stage or is it the actual running of the BASICs byte code, or both ?

When the actual BASIC bytecode program is running I guess we have:
1) A interpreter in PASM executing ZPU byte codes.
2) These byte code are an interpreter in C executing your BASIC byte codes.
3) These byte code are the actual BASIC program.

We might expect this to be rather sluggish:)

One obvious way out is to have the BASIC compiler generate ZPU byte codes rather than your current BASIC byte codes, as you have mentioned before. This removes a whole layer of interpretation and hence a speed up by a huge factor.

Before going down that road it might be worth guestimating what speed up is achievable. For example one would imagine what ZPU byte code sequence would be generated for typical BASIC constructs like assignments, for/next loops, etc. and compare that with the number of ZPU instructions currently executed for the same constructs. I'm sure you have though about all this anyway.

As for the performance of you CACHE code, I have no idea. Does it have multiple pages in HUB at a time and a replacement policy like VMCOG?

It would be interesting to know how the speed using SPI RAM / ROM compares to Jazzed's 32MB RAM CACHE solution or the VMCOG solution.

David Betz · 2010-11-19 04:20

Heater. wrote: »

That's a shame.

Which part is unusable, is it the compilation to byte code stage or is it the actual running of the BASICs byte code, or both ?

When the actual BASIC bytecode program is running I guess we have:
1) A interpreter in PASM executing ZPU byte codes.
2) These byte code are an interpreter in C executing your BASIC byte codes.
3) These byte code are the actual BASIC program.

We might expect this to be rather sluggish:)

Yup. :-(

One obvious way out is to have the BASIC compiler generate ZPU byte codes rather than your current BASIC byte codes, as you have mentioned before. This removes a whole layer of interpretation and hence a speed up by a huge factor.

This had been my plan but even the compiler is far too slow. Speeding up the execution of the compiled basic program won't help that.

Before going down that road it might be worth guestimating what speed up is achievable. For example one would imagine what ZPU byte code sequence would be generated for typical BASIC constructs like assignments, for/next loops, etc. and compare that with the number of ZPU instructions currently executed for the same constructs. I'm sure you have though about all this anyway.

Yes, doing that analysis before jumping into generating ZPU code would make sense. Also, I could target the SPIN VM. Unfortunately, neither of those options address the problem that the bytecode compiler itself is too slow to be usable.

As for the performance of you CACHE code, I have no idea. Does it have multiple pages in HUB at a time and a replacement policy like VMCOG?

It would be interesting to know how the speed using SPI RAM / ROM compares to Jazzed's 32MB RAM CACHE solution or the VMCOG solution.

I can certainly do some more optimization of my cache code. It uses a direct-mapped cache with 32 byte cache lines. Making it multi-way might improve things a lot. I just don't have a lot of confidence that it will be enough to make the compiler useable. I may do it anyway just because it will make GCC/ZOG more usable on the C3 even if my on-board basic development environment isn't. I have been having fun with the Propeller and I don't want to give it up.

Heater. · 2010-11-19 05:12

David,

This had been my plan but even the compiler is far too slow...

Ah.

I could target the SPIN VM...

That's blasphemy round these parts:)

Of course you could also target an LMM kernel. Design one of your own or borrow one from Catalina or such.

Actually we have never discussed your BASICs byte codes. For example if they were few enough and simple enough could you create an interpreter for them in PASM? Along the lines of Zog but different. Who said the BASIC had to be run by the same interpreter as the compiler? That would remove a layer of run time interpretation at the cost of having to maintain yet another virtual machine.

...address the problem that the bytecode compiler itself is too slow...

Hmmm...where does the time go?

Now I'm not really familiar with the world of caches and virtual memory so you might need to explain more about what goes on there. "direct-mapped", "multi-way" etc.

If we look at ZPU execution we have something like this for every instruction:

1) A byte code fetch.
2) One or two accesses to the stack.
3) Perhaps an access to the data area.

So it would seem that as a minimum it would be beneficial to have some part of each of the code, data and stack areas sitting in on-chip RAM as a "working set". If we have to reload the cache every time we go from code to data to stack that is an awful lot of thrashing.

Of course you could be using Catalina or ICC instead but without getting the memory system up to speed there probably aren't many gains to be had.

I would be interested how the compiler performs under VMCOG or SDRAMCACHE. I guess VMCOG does not support sufficient space yet.

Heater. · 2010-11-19 05:49

OK, I just did my first investigations of "direct-mapped cache" here http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/direct.html

I guess yours looks something like that. 128 slots of 32 bytes, 4096 bytes in on-chip RAM and any moment.

Given that you are decoding RAM/ROM space in your memory driver I was wondering if there is any advantage to be gained from having separate caches for code and data. At least that way there would be no cache collisions between stack/data and opcode fetches.

David Betz · 2010-11-19 06:38

I could certainly have separate code and data caches but I wonder if that would be better than just having a multi-way unified cache? Also, I guess I could try using more than 4k of hub RAM as a cache. I will probably play with some of these ideas to see if they help.

Just for a point of reference, what do you consider the fastest external RAM version of ZOG that is available at the moment? What platforms can it run on? I'd try my VMCOG driver but it is currently limited to only 64k and my C code compiles to almost that amount of code never mind data. If I can figure out how to get VMCOG to address both the 64k of SRAM and the 1MB of flash I could try that to see if it's any better than my cache driver.

Bill Henning · 2010-11-19 06:55

David, I am sorry to hear the compiler runs too slowly.

Unfortunately VMCOG does not support >64K VM's right now, but it will as soon as I have some time to work on it.

If your program would run in 128K, you could run two copies of VMCOG (in a separate cog each), with different mailbox addresses - and use one for code, and one for data. You could add the additional SPI ram's to the C3's SPI expansion ports.

If not, maybe you can profile your compiler, and see if you can speed the compiler up.

Where is it spending most of its time?

Is it a recursive descent compiler?

Regards,

Bill

p.s.

One of my upcoming boards will support 8 SPI ram's, so it needs large VMCOG for proper operation

Heater. · 2010-11-19 06:56

Quickly runs away to read up on "multi-way unified cache....".

As you see I'm in no position to advise. Where is Bill when you need him?

Increasing the size of the cache must surely help, as a brute force approach.

...what do you consider the fastest external RAM version of ZOG...

I don't know yet. I have here a TriBlade and VMCOG which means nothing in your case due to the 64K limit. I also have a 32MB GadgetGangster card which due to work/flu/more work I have yet to find time to get running. Most of the timing experiments we have done were using fibo which is somewhat useless for this task.

What platforms can it run on?

Anywhere that is supported by VMCOG, the GadgetGangster 32MB setup, and now your C3 effort.

I know Bill has plans to greatly increase the address space handled by VMCOG, no idea how far along that has come.

I'm somewhat tempted to do what I always said I would not do, create a Zog with direct hardware access to the 512KBytes on the TriBlade card. Perhaps another for the DracBlade. This is the way to go for raw speed on those cards.

None of this will happen until I have the GadgetGangster running here, and floating point and...

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments