Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Bill Henning · 2010-11-10 19:19

Hi Steve!

jazzed wrote: »

Howdy! I'm happy to report the results of some efforts this week

<snip>

I had to use old numbers since people refuse to post anything new for some reason.
ZOG running from HUB could finish FIBO(26) in about 41 seconds according to an old Heater post.

Latest results are about 1/3rd of the way down on page 21 of the VMCOG thread at

http://forums.parallax.com/showthread.php?t=119725&page=21

Unfortunately the version I have wraps the time around, so I'll just compare fibo 20 and fibo 25 here

Fibo test platform:     ZOG HUB*1       ZOG SdramCache*2    ZOG*VMCOG*3
(System Clock)          80 MHz          80 MHz              80 MHz

fibo(20)                684ms           2060ms              2782ms
fibo(25)                7594ms         22855ms             28507ms

SdramCache is approx. 20% faster than VMCOG at 80Mhz

Well done Steve!

jazzed wrote: »

The VMCOG performance I found was for 100MHz Triblade is fibo(20) at 2782ms.

Conclusions:

As far as I can tell without more up to date information ...

SdramCache is 25% faster than VMCOG for fibo(20) on a 20% slower Propeller

SdramCache is 3x slower than HUB on the same machine.

The equivalent SDRAM performance with VMCOG as the front end is impossible to achieve because 1) VMCOG's HUB interface is slower, and 2) there is not enough room for the 32 byte 10MB/s SDRAM read/write code.

Standardizing on VMCOG for external memory interface would be a mistake even if VMCOG could support 32MB+ of memory.

I am afraid I have to disagree with your conclusions - mind you, your conclusions would have been different if you had spotted the VMCOG benchmark results on page 21 of my thread, and if you kept the design goals of VMCOG in mind.

Re/ 1:

VMCOG's HUB interface is definitely slower - and it has to be, due to the overriding design goal of making a VERY SIMPLE to interface to, generic memory interface - it was NOT designed to be the optimal memory interface!

All along the idea was that it would take as little code as possible, and generic code at that, to interface to *ANY* memory hooked up to the prop, and make it transparent to the application (ZOG, or any other virtual machine).

It was designed to be a way of EASILY supporting every memory interface for virtual machines and software; with the idea being that optimized access to memory could be added later, if needed - so that the maintainer of a virtual machine did not have to maintain many different memory interface code sets; and that the access code used little cog memory.

Re/ 2:

I can fit a 32 long access routine in, no sweat - especially in the upcoming big VMCOG, which moves the page present TLB into the hub freeing much cog space at the cost of maybe a 2% performance hit.

Re/ 3: "Standardizing on VMCOG for external memory interface would be a mistake even if VMCOG could support 32MB+ of memory."

I totally disagree - see #1 above for the explanation why. I don't think virtual machine writers should have to support umpteen different memory hardware interfaces, nor do I see why memory hardware providers should have to make special drivers for umpteen different VM's.

Software written against the VMCOG interface should run unchanged *regardless of the underlying memory hardware*.

Sure, it will cause a 20%-25% performance hit compared to your SDCache for your SDRAM module... but:

1) Will all VM's have room for your 32-long code + cache tags + etc - the full footprint of your SDRAM Cache driver?

2) How will the performance compare when large programs are run, without as much locality of reference?

3) Will you write and maintain an cache interface for every other memory interface out there?

From the beginning, I proposed VMCOG as a nice hardware abstraction layer between many different memory expansion gadgets, and many different VM's that can use more memory - with the greatest compatibility and least effort. I never said it was the fastest solution.

A cache-line solution, where the VM/app has direct access to the cache line, IS a better solution for raw throughput - but requires more code in the VM/app, and unique support in the VM/app (or cache manager) for each memory interface (or at least each cache line size).

For the Propeller, for VM's and apps, I think we are better served by an easier to implement standard, without having to worry about manipulating cache lines directly, or worrying about cache line lenghts etc in VM's and apps - and later implement direct interfaces where needed.

Regarding supporting more memory - I am comitted to making a BIG version of VMCOG, however getting the rest of my boards into production with my new board house has priority. I should be back on VMCOG by the end of the month.

Currently, I am planning to support at least 8MB, however the initial big VMCOG will only support up to 2MB to limit the size of the hub based TLB. To support more than 2MB, I will have to add associativity, which will slow it down more - and associativity factors >2 will cause a HUGE slowdown according to my calculations.

jazzed wrote: »

Integrating with Catalina should yield interesting results.

I eagerly await the Catalina results!

jazzed · 2010-11-10 21:44

Gee Bill. Over-reacting just a smidgen?

VMCOG is a fine generic interface to support many platforms even if I do question some of the "ease of use." I'm not suggesting replacing VMCOG with what I've done for all hardware, but anyone who plays this game has the "chops" to do it.

I'm making it perfectly clear by comparisons that leaving out the SdramCache solution (and other good fits) would be a mistake. That's all.

As you have already pointed out in so many words, VMCOG is like vanilla ice cream: simple, no frills, not optimized for a particular taste.

When you have something that can allow SDRAM or DDR when that time comes shine for what it's worth, I'll happily adopt it and so will others. Meanwhile I won't hold my breath for that day. I'll go on to produce Rocky Road, Coconut Sorbet, and others.

Cheers.
--Steve

Heater. · 2010-11-10 23:09

Just now I'm stricken with a terrible case of flu and my fevered mind can only just following what you are up to.

It all sounds very good and I can only make one comment:

There is no point in drawing any conclusions about virtual memory/cache/whatever performance on the basis of the execution time of FIBO. It's much too small. It only needs a weeny bit each of stack, data and code memory to be in place during its work.

OK, two comments:

As we have 32MB bolted onto the Prop we damn well want to be able to get at all of it:)

RossH · 2010-11-10 23:44

Heater. wrote: »

As we have 32MB bolted onto the Prop we damn well want to be able to get at all of it:)

[FONT=Verdana, Helvetica, sans-serif]640 K ought to be enough for anybody.[/FONT]

David Betz · 2010-11-11 05:07

I've been working on getting my C3 version of jazzed's SdramCache interface working and I think I found a bug in ZOG itself. In the functions cache_read and cache_write, the READ_CMD or WRITE_CMD bits are ORed with the address and stored in mboxcmd. This works well with word or long aligned addresses but doesn't work correctly for odd byte addresses. In particular, the WRITE_CMD constant has a zero in the low order bit and if you OR it with an odd memory address you turn it into a read command. I've added an ANDN instruction just before the OR to eliminate this problem and this seems to make my test program work correctly. Let me know if you see any problems with this solution.

cache_write             ' if cacheaddr <> addr, load new cache line
                        mov     memp, addr                  'save address for index
                        andn    addr, #1   ' <---- added by dbetz
                        or      addr, #cache#WRITE_CMD
                        call    #cache_access
cache_write_ret         ret

cache_read              ' if cacheaddr <> addr, load new cache line
                        mov     memp, addr                  'save address for index
                        andn    addr, #1   ' <---- added by dbetz
                        or      addr, #cache#READ_CMD
cache_access
                        wrlong  addr, mboxcmd
                        mov     cacheaddr,addr              'Save new cache address. it's free time here
                        andn    cacheaddr,#cache#LINELEN-1  'Kill command bits in free time
:waitres                rdlong  temp, mboxcmd
                        tjnz    temp, #:waitres             'We have room for tjnz 4 or 8 cycles
                        rdlong  mboxptr, mboxdat            'Get new buffer
                        and     memp, #(cache#LINELEN-1)    'memp is index into buffer
                        add     memp, mboxptr               'memp is now HUB buf address of data to read
cache_read_ret
cache_access_ret        ret

It might actually be best to AND with 3 instead of just 1 since the command is actually two bits wide but the high bit of the command is always set and so it doesn't really matter if we mask the corresponding address bit. The best bet would probably be to do something like this:

' in the cache code:
CON
  CMD_MASK = 3

' in ZOG:
  andn addr, #cache#CMD_MASK

In any case, this makes my basic VM work correctly under ZOG with my cache code. Now I can start working on adding access to the SPI flash!

Heater. · 2010-11-11 07:47

Jazzed,

A few thoughts from my flu addled mind...

I do believe HUB only support should be available some time, some way.

HUB only support is available already. In the Zog package there is the run_zog.spin program which can:

1) Load up the ZPU interpreter allocating all of HUB to ZPU code apart from the dispatch table which is moved to the top of HUB. The ZPU code to be run is moved down to address zero and the stack put just under the dispatch table. All spin code is obliterated.

2) Alternatively load the ZPU interpreter and run the ZPU binary from wherever it is in the HUB. Then continue to run some Spin code.

There is a CUCKOO_MODE define to select one of these options.

Option 1) basically gives you a running ZPU interpreter instead of a running Spin interpreter for the main program. Any COG devices loaded and run after that is up to the C application in the same way as it is in any Spin program. As a start the included demo C program starts up a FullDuplexSerial object and a VMCog. This is all somewhat experimental so far. It does lead to the possibility of rewriting debug_zog in C:)

Does ZOG/ZPU not have an ioctl syscall?

Strangely enough the ZPU libs do not define an ioctl system call. No reason we could not add one though. iocltl is of course a general purpose hook that any device driver can use (or not) for any purpose. It parameters and function are different for every device. How would you define it? How would we hook ioctl handlers up, Spin has no idea of function pointers?

Let me know the approach you plan to use for associating file/device descriptor IDs with devices. I'll hard-code something for the time being. Console uses 0,1,2 ... I'll try using 3 or something for screen block driver.

Interesting. Traditionally file handles/descriptors are allocated to open files. Which may or may not actually be some other device and not an actual file. Handles 0, 1 and 3 are reserved for stdin, stdout, stderr whether they are real files or devices. Actual mapping of file names to devices, like the console port was done through major and minor numbers which were hard-coded into the /dev/directory entries.

In our case we just have to be sure not to clash with real file handles.

Where is this separate function for read BYTE and how do I use it from C? Read/modify/write using long is probably sufficient for byte only data elements.

I did not explain that very well. Currently in the interpreter there are routines to read and write LONGs. Only these routines implement the HUB/EXT memory map as described. So access to WORDs and BYTEs and stack ops are not mapped.

In C you can write:

int x = *some_hub_long_ptr;
char y = *some_hub_byte_ptr;

In which case x will be read as a LONG and mapped correctly to HUB. The read of y will fail.

So, as I said, I suspect for video RAM in HUB, BYTE access will be required and the mapping should also be applied to read/write_byte in the interpreter.

David Betz · 2010-11-11 08:06

Heater. wrote: »

Traditionally file handles/descriptors are allocated to open files. Which may or may not actually be some other device and not an actual file. Handles 0, 1 and 3 are reserved for stdin, stdout, stderr whether they are real files or devices. Actual mapping of file names to devices, like the console port was done through major and minor numbers which were hard-coded into the /dev/directory entries.

In our case we just have to be sure not to clash with real file handles.

In my case, I wanted to move the filesystem handling into the C code since the alternatives currently available in Spin don't support multiple open files very well. So I hijacked file descriptor 3 (0, 1, and 2 are stdin/stdout/stderr) for raw SD sector I/O. Maybe I should have added another syscall for the raw I/O. It was just easier to use one of the existing syscalls. What I'd really like to do is move the sector I/O into the C code as well. I may work on that next.

jazzed · 2010-11-11 08:41

@Heater, I certainly appreciate your post and will respond with thoughts on using
ioctl separately (I'm using a special read to get TV_TEXT parameters for now).

@David, Here is an alternative for ZOG's cache access which saves some time:

cache_write             mov     memp, addr                  'save address for index
                        andn    addr, #cache#CMD_MASK       'ensure a write is not a read
                        or      addr, #cache#WRITE_CMD
                        jmp     #cache_access

cache_read              mov     memp, addr                  'save address for index
                        andn    addr, #cache#CMD_MASK       'ensure a read is not a write
                        or      addr, #cache#READ_CMD

cache_access            ' if cacheaddr <> addr, load new cache line
                        wrlong  addr, mboxcmd
                        mov     cacheaddr,addr              'Save new cache address. it's free time here
                        andn    cacheaddr,#cache#LINELEN-1  'Kill command bits in free time
:waitres                rdlong  temp, mboxcmd
                        tjnz    temp, #:waitres             'We have room for tjnz 4 or 8 cycles
                        rdlong  mboxptr, mboxdat            'Get new buffer
                        and     memp, #(cache#LINELEN-1)    'memp is index into buffer
                        add     memp, mboxptr               'memp is now HUB buf address of data to read
cache_read_ret
cache_write_ret
cache_access_ret        ret

If address 0 was not a valid address, we could eliminate the andn for CMD_MASK completely.

David Betz · 2010-11-11 08:44

jazzed wrote: »

@Heater, I certainly appreciate your post and will respond with thoughts on using
ioctl separately (I'm using a special read to get TV_TEXT parameters for now).

@David, Here is an alternative for ZOG's cache access which saves some time:

cache_write             mov     memp, addr                  'save address for index
                        andn    addr, #cache#CMD_MASK       'ensure a write is not a read
                        or      addr, #cache#WRITE_CMD
                        jmp     #cache_access

cache_read              mov     memp, addr                  'save address for index
                        andn    addr, #cache#CMD_MASK       'ensure a read is not a write
                        or      addr, #cache#READ_CMD

cache_access            ' if cacheaddr <> addr, load new cache line
                        wrlong  addr, mboxcmd
                        mov     cacheaddr,addr              'Save new cache address. it's free time here
                        andn    cacheaddr,#cache#LINELEN-1  'Kill command bits in free time
:waitres                rdlong  temp, mboxcmd
                        tjnz    temp, #:waitres             'We have room for tjnz 4 or 8 cycles
                        rdlong  mboxptr, mboxdat            'Get new buffer
                        and     memp, #(cache#LINELEN-1)    'memp is index into buffer
                        add     memp, mboxptr               'memp is now HUB buf address of data to read
cache_read_ret
cache_write_ret
cache_access_ret        ret

If address 0 was not a valid address, we could eliminate the andn for CMD_MASK completely.

You can also eliminate the andn if the XXX_CMD value you're ORing into the address has all of its bits set. It would probably be best to make READ_CMD=3 since I suspect reads are far more common than writes.

Bill Henning · 2010-11-11 08:49

I managed to shave one instruction out, but cavet emptor - this code is not tested!

cache_write             movs	cmdpoke, #cache#WRITE_CMD
                        jmp     #cache_access

cache_read              movs	cmdpoke, #cache#READ_CMD

cache_access            ' if cacheaddr <> addr, load new cache line
                        mov     memp, addr                  'save address for index
                        andn    addr, #cache#CMD_MASK       'ensure a write is not a read
cmdpoke			or	addr, #0-0
                        wrlong  addr, mboxcmd
                        mov     cacheaddr,addr              'Save new cache address. it's free time here
                        andn    cacheaddr,#cache#LINELEN-1  'Kill command bits in free time
:waitres                rdlong  temp, mboxcmd
                        tjnz    temp, #:waitres             'We have room for tjnz 4 or 8 cycles
                        rdlong  mboxptr, mboxdat            'Get new buffer
                        and     memp, #(cache#LINELEN-1)    'memp is index into buffer
                        add     memp, mboxptr               'memp is now HUB buf address of data to read
cache_read_ret
cache_write_ret
cache_access_ret        ret

jazzed wrote: »

@Heater, I certainly appreciate your post and will respond with thoughts on using
ioctl separately (I'm using a special read to get TV_TEXT parameters for now).

@David, Here is an alternative for ZOG's cache access which saves some time:

cache_write             mov     memp, addr                  'save address for index
                        andn    addr, #cache#CMD_MASK       'ensure a write is not a read
                        or      addr, #cache#WRITE_CMD
                        jmp     #cache_access

cache_read              mov     memp, addr                  'save address for index
                        andn    addr, #cache#CMD_MASK       'ensure a read is not a write
                        or      addr, #cache#READ_CMD

cache_access            ' if cacheaddr <> addr, load new cache line
                        wrlong  addr, mboxcmd
                        mov     cacheaddr,addr              'Save new cache address. it's free time here
                        andn    cacheaddr,#cache#LINELEN-1  'Kill command bits in free time
:waitres                rdlong  temp, mboxcmd
                        tjnz    temp, #:waitres             'We have room for tjnz 4 or 8 cycles
                        rdlong  mboxptr, mboxdat            'Get new buffer
                        and     memp, #(cache#LINELEN-1)    'memp is index into buffer
                        add     memp, mboxptr               'memp is now HUB buf address of data to read
cache_read_ret
cache_write_ret
cache_access_ret        ret

If address 0 was not a valid address, we could eliminate the andn for CMD_MASK completely.

jazzed · 2010-11-11 09:24

David Betz wrote: »

You can also eliminate the andn if the XXX_CMD value you're ORing into the address has all of its bits set. It would probably be best to make READ_CMD=3 since I suspect reads are far more common than writes.

Excellent suggestion. There is a caveat, but it is commented in code.

Here are attachments for what I'm using in ZOG now. The debug_zog has dependencies for my 32MB hardware which include TWI Keyboard and TV_TEXT -- never replace files without first backing them up

@Bill, thanks for that suggestion. Maybe you can find other potential optimizations or general improvements.

Cheers

--Steve

Bill Henning · 2010-11-11 10:42

Will do!

jazzed wrote: »

Excellent suggestion. There is a caveat, but it is commented in code.

Here are attachments for what I'm using in ZOG now. The debug_zog has dependencies for my 32MB hardware which include TWI Keyboard and TV_TEXT -- never replace files without first backing them up

@Bill, thanks for that suggestion. Maybe you can find other potential optimizations or general improvements.

Cheers
--Steve

jazzed · 2010-11-11 13:58

Heater. wrote: »

Strangely enough the ZPU libs do not define an ioctl system call. No reason we could not add one though. iocltl is of course a general purpose hook that any device driver can use (or not) for any purpose. It parameters and function are different for every device. How would you define it? How would we hook ioctl handlers up, Spin has no idea of function pointers?

@Heater,

Basically IOCTL function enumerations can be used for driver hookups.

Sharing the enum definitions between C and Spin is troublesome because of syntax, but is not impossible. Sharing enums at compile time is important to keep IOCTL interfaces sane (an OS doesn't usually have the compile time luxury).

The actual IOCTL implementations would be defined by the driver writer. IOCTL numbers for a device function/properties can be handled with a simple case statement.

Regarding file descriptor numbers, since the file-system driver can use syscalls for opening a file, it can allocate file descriptors in cooperation with the device driver manager. I believe a file descriptor allocator would be required. A simple linear lookup allocator could be used at first, and something like a heap data structure could be used later if performance degrades with a larger system.

Of course we could just port uClinux and use it's mechanisms ....

Cheers.
--Steve

PS: I hope you feel better soon.

David Betz · 2010-11-11 15:26

jazzed wrote: »

Here are attachments for what I'm using in ZOG now. The debug_zog has dependencies for my 32MB hardware which include TWI Keyboard and TV_TEXT -- never replace files without first backing them up

Ummm... I just tried these and they don't work with my SPI SRAM version of the cache interface. I tried just patching in your changes to the cache_read and cache_write routines in zog.spin and they seem to work fine. What other changes did you make to zog.spin and debug_zog.spin?

jazzed · 2010-11-11 16:05

David Betz wrote: »

Ummm... I just tried these and they don't work with my SPI SRAM version of the cache interface. I tried just patching in your changes to the cache_read and cache_write routines in zog.spin and they seem to work fine. What other changes did you make to zog.spin and debug_zog.spin?

@David,

Obviously the driver object and memory size in debug_zog.spin will be different. I also added some #ifdef SDCARD to debug_zog.spin where it was not possible to compile with USE_HUB_MEMORY.

Doing diffs between files always pays for me.

Probably the simplest thing to do at this point is to make all zog.spin cache calls look the same. That can eliminate uncertainty there and performance won't be much worse.

Considering how long it's been since a shared post, big divergences have surely happened. I'm not up to it right now, but this sort of thing can take a while and might be best if handled via email. If you like you can post your functional zog.spin here and I can look at that tonight.

David Betz · 2010-11-11 16:13

jazzed wrote: »

@David,

Obviously the driver object and memory size in debug_zog.spin will be different. I also added some #ifdef SDCARD to debug_zog.spin where it was not possible to compile with USE_HUB_MEMORY.

Yeah, I did that in my version of debug_zog.spin too when I tried compiling with USE_HUB_MEMORY and it failed.

Doing diffs between files always pays for me.

That's the first thing I tried. Unfortunately, the cygwin 'diff' command thought that my version of zog.spin was totally different than yours. Might have something to do with line end characters. I'll have to check further into that.

Probably the simplest thing to do at this point is to make all zog.spin cache calls look the same. That can eliminate uncertainty there and performance won't be much worse.

I thought I had done that. I guess I'd better get my 'diff' program working so I can find out what hack I put in that is interfering with your new code! :-)

Considering how long it's been since a shared post, big divergences have surely happened. I'm not up to it right now, but this sort of thing can take a while and might be best if handled via email. If you like you can post your functional zog.spin here and I can look at that tonight.

I'll try harder to get 'diff' working before I bother you with that. I haven't actually changed zog.spin much at all so it shouldn't be too hard to make that work. The changes in debug_zog.spin might take a bit longer. Right now I'm looking at adding SPI flash support to my cache code. In the process, I'd like to add the ability to have more than just a read and a write command. I'll probably add some more but you won't want to adopt my extension of your interface because it will almost certainly slow things down. At a minimum, one extra test and branch.

Thanks,
David

David Betz · 2010-11-11 16:27

jazzed wrote: »

Doing diffs between files always pays for me.

I see why 'diff' failed. It seems that my version of zog.spin is stored as UTF-16 or something like that. Every other byte in the file is a zero. I'll have to figure out how to convert this into a straight ASCII file so I can compare it with yours.

jazzed · 2010-11-11 19:50

David Betz wrote: »

I see why 'diff' failed. It seems that my version of zog.spin is stored as UTF-16 or something like that. Every other byte in the file is a zero. I'll have to figure out how to convert this into a straight ASCII file so I can compare it with yours.

Parallax font

Does cygwin have iconv ?
http://en.wikipedia.org/wiki/Iconv

David Betz · 2010-11-14 18:19

I'm trying to get ZOG running out of the SPI flash chip on the C3 and in the process I've had to write a new GCC linker script to place the code in the flash. Unfortunately, I'm having trouble getting it to work. Here is the script I have written.

OUTPUT_FORMAT("elf32-zpu", "elf32-zpu", "elf32-zpu")
OUTPUT_ARCH(zpu)
SEARCH_DIR(.);
ENTRY(_start)
SEARCH_DIR("/home/dbetz/zpugcc/toolchain/build/../install/zpu-elf/lib");

RAMSIZE = 64 * 1024;
ROMSIZE = 1 * 1024 * 1024;

MEMORY {
  ram : ORIGIN = 0x00000000, LENGTH = 64K
  rom : ORIGIN = 0x00100000, LENGTH = 1M
  hub : ORIGIN = 0x10000000, LENGTH = 64K
  cog : ORIGIN = 0x10008000, LENGTH = 2K
  io  : ORIGIN = 0x10008800, LENGTH = 1K
}
  
SECTIONS {
  .text : {
    LONG(_start)
    LONG(_data_image)
    LONG(_data_start)
    LONG(_data_end)
    LONG(_bss_start)
    LONG(_bss_end)
    *(.fixed_vectors)
    *(.text)
    *(.text.*)
    *(.rodata)
    *(.rodata.*)
    . = ALIGN(4);
    _data_image = .;
  } >rom
  .data BLOCK(4) : {
    _data_start = .;
    *(.data)
    *(.data.*)
    . = ALIGN(4);
    _data_end = .;
  } >ram AT>rom
  .bss BLOCK(4) : {
    _bss_start = .;
    *(.bss)
    *(.bss.*)
    . = ALIGN(4);
    _bss_end = .;
    _end = .;
  } >ram
  .stack RAMSIZE - 0x1000 : {
    _stack = .;
    *(.stack)
  } >ram
  .hub : {
    *(.hub)
  } >hub
  .cog : {
    *(.cog)
  } >cog
  .io : {
    *(.io)
  } >io
}

The basic idea is to put all of the code and constant data in the flash starting at location 0x100000 and any data in SRAM at location 0. The only tricky part is that I put the initial values of all of the stuff in .data into the .text segment and copy it into SRAM at startup time. This seems to work okay. My problem is that I don't seem to end up linking in the _start code that is in crt0.S in the libgloss directory. Instead, I get some other _start code that doesn't work. Is there some trick to getting the correct startup code linked in?

Thanks,
David

Heater. · 2010-11-14 23:47

David,

My bout of flue is only just subsiding so I've hardly been able to follow what you guys are up to. That just leaves me with the problem of catching up with work at work:(

I'm really glad you are looking at running ZPU codes from SPI FLASH. I think that is the one use case were Zog could possibly out perform Catalin and ICC

Now I really wish I understood that linker script better.

1) In your linker script you have:

....
SECTIONS {
  .text : {
    LONG(_start)
    ....
    LONG(_bss_end)
    *(.fixed_vectors)
    *(.text)
    ....

If I'm reading that correctly it is:
a) Creating an output output section ".text" that will contain...
b) ...the input sections " .fixed_vectors", ".text", and ".rodata".
c) The resulting ".text" section to be placed at "rom" address.
d) Which is defined as 0x00100000 in the MEMORY clause.

Am I right so far?

So a few thoughts:

1) If you look at a listing of the finished build, or the hex dump, it will not look like the original assembler source code. Why?

The assembler source at _start looks like this:

_start:
_memreg:
                ; intSp must be 0 when we jump to _premain
                im ZPU_ID
                loadsp 0

But that "im ZPU_ID" actually assembles to more than one instruction, more like:

This is because ZPU_ID is a small constant but the assembler always generates enough "im"s to load a 32 bit constant.

Normally the linker, when used with the "relax" option will replace all the long "im" sequences with just enough to load the required constant, in this case just one. Thus optimizing executable size.

But ...the ".fixed_vectors" section is arranged to be non-executable so that the relaxation is not done. See this note in crt0.S:

; KLUDGE!!! we remove the executable bit to avoid relaxation 
        .section ".fixed_vectors","a"

As a result the final output of the build results in something like:

Or the op code sequence 0b, 0b, 0b, 0b, 82.

2) PROBLEM: At _start is also defined _memreg.

memreg locates pseudo registers by GCC in some cases, like returning values from functions. Can't remember but this is the base address of a two or four register array area. So the problem is that memreg must be in RAM. This new linker script puts memreg in read only FLASH and so the code will not run correctly.

memreg is defined at address zero so as to minimize the number of IMs the compiler must generate to use it but it could be defined anywhere in RAM with a little performance hit.

BUT...the SYSCALL in Zog is hard coded to use mmereg at address zero at the moment.

3) Luckily we don't use any of those EMULATE vectors in _start so moving them to FLASH, or removing them from crt0.S, should not hurt. However there is an interrupt vector defined there which we might want to take into use one day and should not be relocated to FLASH.

Hope this gives you some clues.

Heater. · 2010-11-14 23:54

David,

Is there some trick to getting the correct startup code linked in?

I'd be inclined to go though the zpugcc directories and delete all the crt0 object files I can find. Then rebuild the crt0 and link again. Just to be sure the is no old junk getting in.

David Betz · 2010-11-15 03:54

Heater: Thanks for all of your suggestions about building ZOG. I will certainly move the registers back to SRAM. I still have that located at zero so that shouldn't be too hard Do you know which linker script gets used by default to build ZOG programs?

David Betz · 2010-11-15 08:47

Heater. wrote: »

David,

My bout of flue is only just subsiding so I've hardly been able to follow what you guys are up to. That just leaves me with the problem of catching up with work at work:(

I'm really glad you are looking at running ZPU codes from SPI FLASH. I think that is the one use case were Zog could possibly out perform Catalin and ICC

Couldn't they also execute code out of flash? I've been thinking of trying to interface my cache code with Catalina but it certainly isn't as straight forward to do that as to interface with ZOG.

Now I really wish I understood that linker script better.

1) In your linker script you have:
....
SECTIONS {
  .text : {
    LONG(_start)
    ....
    LONG(_bss_end)
    *(.fixed_vectors)
    *(.text)
    ....
If I'm reading that correctly it is:
a) Creating an output output section ".text" that will contain...
b) ...the input sections " .fixed_vectors", ".text", and ".rodata".
c) The resulting ".text" section to be placed at "rom" address.
d) Which is defined as 0x00100000 in the MEMORY clause.

Am I right so far?

Yes, that is correct.

So a few thoughts:

1) If you look at a listing of the finished build, or the hex dump, it will not look like the original assembler source code. Why?

The assembler source at _start looks like this:
_start:
_memreg:
                ; intSp must be 0 when we jump to _premain
                im ZPU_ID
                loadsp 0
But that "im ZPU_ID" actually assembles to more than one instruction, more like:
                im 0
                im 0
                 im 0
                 im 0
                 im ZPU_ID
This is because ZPU_ID is a small constant but the assembler always generates enough "im"s to load a 32 bit constant.

Normally the linker, when used with the "relax" option will replace all the long "im" sequences with just enough to load the required constant, in this case just one. Thus optimizing executable size.

But ...the ".fixed_vectors" section is arranged to be non-executable so that the relaxation is not done. See this note in crt0.S:
; KLUDGE!!! we remove the executable bit to avoid relaxation 
        .section ".fixed_vectors","a"
As a result the final output of the build results in something like:
                nop
                 nop
                  nop
                  nop
                  im 2
Or the op code sequence 0b, 0b, 0b, 0b, 82.

That explains a lot. I was wondering why the disassembled code didn't match the code in crt0.S.

2) PROBLEM: At _start is also defined _memreg.

memreg locates pseudo registers by GCC in some cases, like returning values from functions. Can't remember but this is the base address of a two or four register array area. So the problem is that memreg must be in RAM. This new linker script puts memreg in read only FLASH and so the code will not run correctly.

memreg is defined at address zero so as to minimize the number of IMs the compiler must generate to use it but it could be defined anywhere in RAM with a little performance hit.

BUT...the SYSCALL in Zog is hard coded to use mmereg at address zero at the moment.

Do _start and memreg need to share space or can I move memreg to the start of SRAM which is still at address zero?

3) Luckily we don't use any of those EMULATE vectors in _start so moving them to FLASH, or removing them from crt0.S, should not hurt. However there is an interrupt vector defined there which we might want to take into use one day and should not be relocated to FLASH.

Hope this gives you some clues.

[/QUOTE]
So we don't emulate any instructions in ZOG? They are all implemented by the ZOG COG code?

Heater. · 2010-11-16 03:32

David,

Couldn't they also execute code out of flash?

Yes, I'm sure they could. But then they lose the speed advantage of using PASM instructions what with having to fetch 4 bytes for every instruction rather than 1.

Do _start and memreg need to share space or can I move memreg to the start of SRAM which is still at address zero?

No. I see no reason why memreg can't be moved wherever you like.

So we don't emulate any instructions in ZOG?

Correct, all ZPU ops that could be done via EMULATE are done in PASM by Zog. So we can remove almost 1K of emulation code and vectors from crt0.s

Do you know which linker script gets used by default to build ZOG programs?

You won't find a linker script. zpugcc uses some kind of internal setup.

You can see what that is by passing the verbose option to the linker step.

LDFLAGS=$(ZPUPLATFORM) -Wl,-verbose -Wl,--relax -Wl,--gc-sections

David Betz · 2010-11-16 04:53

If I try to run the version of fibo that I built with my new linker script I get the following error:

ZOG v1.6 (CACHE)
Starting SD driver...0000FFFF
Mounting SD...00000000

pc       op sp       tos      reason
00100030 00 0000FFF4 00000014 BREAKPOINT

The disassembled code looks like this:

00100020 <_start>:
  100020:       0b              nop
  100021:       0b              nop
  100022:       0b              nop
  100023:       0b              nop
  100024:       82              im 2
  100025:       70              loadsp 0
  100026:       0b              nop
  100027:       0b              nop
  100028:       0b              nop
  100029:       0b              nop
  10002a:       94              im 20
  10002b:       0c              store
  10002c:       3a              config
  10002d:       0b              nop
  10002e:       80              im 0
  10002f:       c0              im -64
  100030:       f0              im -16
  100031:       a1              im 33
  100032:       04              poppc

I'm not sure why it would think there is a breakpoint at 100030. I'm wondering if I might have endian issues again. Any ideas?

Heater. · 2010-11-16 05:48

You could try compiling zog with SINGLE_STEP defined and see if it executes correctly to that point.

The value in tos at the breakpoint does not look right.

David Betz · 2010-11-16 06:01

Heater. wrote: »

You could try compiling zog with SINGLE_STEP defined and see if it executes correctly to that point.

The value in tos at the breakpoint does not look right.

Actually, the stack pointer itself does not look right either. There is only 64k of SRAM and the stack pointer points almost at the top of that space. I think the ZPU stack grows up not down. I'm going to rebuild my crt0.S to move the registers to SRAM to see if that helps.

David Betz · 2010-11-16 06:28

Here is what I get when I single step ZOG.

pc       op sp       tos      reason
00100020 0B 0000FFF8 80FF0204
00100021 0B 0000FFF8 80FF0204
00100022 0B 0000FFF8 80FF0204
00100023 0B 0000FFF8 80FF0204
00100024 0B 0000FFF8 80FF0204
00100025 0B 0000FFF8 80FF0204
00100026 70 0000FFF8 80FF0204
00100027 82 0000FFF4 80FF0204
00100028 0C 0000FFF0 00000002
00100029 A4 0000FFF8 80FF0204
0010002A 0B 0000FFF4 00000024
0010002B 0B 0000FFF4 00000024
0010002C C0 0000FFF4 00000024
0010002D 80 0000FFF0 FFFFFFC0
0010002E 0B 0000FFF0 FFFFE000
0010002F 3A 0000FFF0 FFFFE000
00100030 00 0000FFF4 00000024 BREAKPOINT

Heater. · 2010-11-16 06:31

No. The ZPU stack grows downward.

Normally one would initialize it to address 4 bytes greater than the highest RAM address.

I set it a bit lower so that my single step traces matched those of the VHDL ZPU running under the GHDL simulator and I could "diff" the results easily.

David Betz · 2010-11-16 06:40

I just compared my single step trace with the dump I posted a little earlier and it looks like I am having endian problems. The opcodes that are executed in the single step are the same as those in the dump but in a different order. I guess I'll have to figure out what I'm doing wrong. I byteswapped my binary file before writing it to flash. Maybe that isn't necessary.

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments