Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

lonesock · 2010-02-05 20:08

Cluso99 said...
ZOG does not have a ring to it. BTW neither does ZPUCog. And I agree it cannot be ZyCog either. :-(
How about Z_Cog, SoftCog (since it is sort of a soft cog), ZPCog. Nothing sounds as good as ZiCog & MoCog.

CogZ or Cog-Z?

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Bill Henning · 2010-02-05 20:11

I like CogZ... has a nice ring to it.

lonesock said...

Cluso99 said...
ZOG does not have a ring to it. BTW neither does ZPUCog. And I agree it cannot be ZyCog either. :-(
How about Z_Cog, SoftCog (since it is sort of a soft cog), ZPCog. Nothing sounds as good as ZiCog & MoCog.

CogZ or Cog-Z?

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system

jazzed · 2010-02-05 20:23

lonesock said...
CogZ or Cog-Z?
Jonathan

Much better than ZPU ... ZeePoooh! yuck! [noparse]:)[/noparse]

heater · 2010-02-05 21:10

Bill: A Spin version of ZPU is definitely in the works but first I'll get the C version running so that we know we understand the ops correctly. Having a debugger would be excellent.

Lonesock & Bill : CogZ it is then. I'll update the first post and thread title.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-02-05 23:21

@heater,

Re: CogZ vs Catalina & Imagecraft ...

For LMM programs CogZ won't even be in the race, since a LONG fetch from hub is the same as a BYTE fetch from hub - and once fetched, these compilers execute instructions at PASM speed. But CogZ may be competitive for XMM programs, where the benefit of fetching a byte vs a long can be significant. It will depend a bit on the hardware.

However, I think you may have found a good solution for "big SPIN"

I'll watch this thread with interest.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

heater · 2010-02-06 05:59

RossH. I would not dream of challenging Catalina or ImageCraft to a drag race "along the straight" in LMM.

In that mode it's only benefit might be it's smaller code size as all ops are only byte wide. So let's see, could someone compile that FIBO I posted with Catalina and ImageCraft and we can see how the code size compares.

The ZPU will fair better in speed comparison with XMM. But I fear even there it will trailing in last because it is a stack based machine with no registers so there is an awful lot of memory access going on.

Here is an example of a simple C statement and the ZPU asm GCC generates. Annotations by me:

int a, b, c;
int main(int argc, char* argv[noparse][[/noparse]])  {
        a = b + c;
        return (a);
}

Generates:

        im b        ; Push the address of "b" to stack
        load        ; Push b to stack top using address popped from stack
        im c        ; Push the address of "c" to stack
        load        ; Push c to stack using address popped
        add         ; Pop and add the top two stack items and push result to stack
        im a        ; Push the address of "a" to stack
        store      ; Pop "a" address then "a" and store "a" to the address

Now: An "IM b" can be done in one byte opcode read + 4 byte writes for the PUSH to stack. Total 5 memory accesses.
A LOAD or STORE is one byte read for the op + two times PUSH/POP at 4 bytes each + a memory read of 4 bytes. Total 13 accesses.
The ADD is one byte for the op + two POPS and a PUSH for a total of 13 accesses.

I make that a grand total of 67 byte accesses to external RAM to add two 32 bit numbers !!!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Mike Green · 2010-02-06 06:24

Is there a particular reason why the stack has to be in XMM as opposed to Hub RAM?

If there needs to be a single uniform address space, how about placing Hub RAM at one end of the address space and pointing the stack pointer there.

Post Edited (Mike Green) : 2/6/2010 6:29:19 AM GMT

heater · 2010-02-06 06:39

A very good question Mike and one I have already pondered.

I suspect it would be quite possible to keep the stack in HUB. That would save us 36 external memory accesses in the above example. Halving the external memory thrashing!!

In the general case of course it is quite possible for ZPU to do LOAD and STORE to somewhere within the stack in which case a stack in HUB would fail unless we did some checking of the LOAD/STORE addresses every time and sorted things out accordingly.

I suspect that normal everyday GCC generated code will never do that and we could get away with it. In which case we start closing in on Catalina and Image Craft performance for XMM.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-02-06 06:40

heater,

Mike is right - your stack should always be in hub RAM. Catalina's LARGE memory mode has a flat address space, but the first $8000 addresses are the normal hub RAM addresses and the stack is always allocated in this address range.

Time/space tradeoffs always fascinate me. Your addresses will presumably be 32 bits, so the code will not be 1/4 the size of Catalina's - but it could be 1/2 the size. Whether it is worth it depends on how much speed you lose. I'll be interested to see the results.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

heater · 2010-02-06 06:44

Mike, I posted as you edited your post. Yes I think that's what I meant by "checking" and "sorting" the LOAD store accesses addresses. That checking takes a bit of PASM but then the gains are still worth it.

I reckon on the first ZPU implementation to be in SPIN and no XMM so the stack will already be in the right place, in HUB, then we can experiment with moving the code and then data to XMM.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Post Edited (heater) : 2/6/2010 9:46:21 AM GMT

jazzed · 2010-02-06 06:49

BTW: AFAIK there is no ImageCraft XMM solution today. I got something working but lost interest after fighting the LMM kernel quirks for so long.

Cluso99 · 2010-02-06 07:00

heater: since the ZPU is only tiny, I wonder if there would be space in the cog for the stack? I know it is a tall ask, but it could speed things significantly if it could fit.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz

heater · 2010-02-06 07:13

Blimey, The first version of CogZ does not exist yet and we already have a dozen optimizations on the table[noparse]:)[/noparse]

The Stack in Cog may also be possible. Whilst the base ZPU is tiny it's probably worth more to fill up the COG with PASM implementations of the ZPU ops that would otherwise be emulated in ZPU ASM.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-02-06 07:20

Heater, Cluso ...

Having the call stack in cog appealed to me when I was writing Catalina - but it REALLY complicates things. For instance, you can probably only fit the call stack, not the stack frame (which contains all your local variables and function arguments) - otherwise it is quite possible to use all your cog stack space on one function call. But this means you actually have TWO stacks - a call stack in cog ram, and a frame stack in hub ram. The added complexity just didn't seem worth the potential savings it to me.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

heater · 2010-02-06 09:43

I really don't want to get into the complication of separate call and data stacks. I'd rather make the restriction that if you want fast code then you don't put lots of local data in your functions.

So what we need is a flat memory space presented to ZPU via an interface that maps some of it to to HUB and some of it to external RAM. So if you use a lot of stack you will find the code slowing down dramatically.

Problem: I've just been looking at the Java ZPU simulator and it implements a stack that grows upwards!. This seems to complicate that idea somewhat.

RossH: "Your addresses will presumably be 32 bits, so the code will not be 1/4 the size of Catalina's"

ZPU has an interesting feature. In the code snippet I posted above you will see things like "im b" where it is loading the address of "b" from immediate data. Thing is that "im" is a one byte instruction, the top bit "1" indicates "im" and the other seven bits are the immediate data. So small addresses can be loaded with a single one byte "im". For larger addresses just chain "im"s together and it will load another 7 bits and another 7 bits etc. In this way for small data sets the code can be quite small.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Humanoido · 2010-02-06 10:13

heater - this is a really great project! Keep up the outstanding work!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
humanoido
*Stamp SEED Supercomputer *Basic Stamp Supercomputer *TriCore Stamp Supercomputer
*Minuscule Stamp Supercomputer *Three Dimensional Computer *Penguin with 12 Brains
*Penguin Tech *StampOne News! *Penguin Robot Society
*Handbook of BASIC Stamp Supercomputing
*Ultimate List Propeller Languages
*MC Prop Computer

Bill Henning · 2010-02-06 16:50

Interesting...

Can you compile the following:

int go deep(int n) {
int b;
b += go_deep(n-1);
return &b;
}

It is a nonsense function, but it should stop the optimizer from mucking too much with the code.

I am interested in how it handles references to addresses of local variables in C functions. Reason: It might be difficult to put the stack into the hub, depending on how it handles it.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system

heater · 2010-02-06 18:25

Bill, that turned out to be a deep question [noparse]:)[/noparse]

This is long so please persevere.

You if you compile a single module with -S to just get an assembler listing you get this:

go_deep:
    im _memreg+12
    load
    pushsp
    im _memreg+12
    store
    im -2
    pushspadd
    popsp
    im _memreg+12
    load
    im 8
    add
    load
    im -1
    add
    loadsp 0
    storesp 8
    storesp 8
    impcrel (go_deep)
    callpcrel
    im _memreg+0
    load
    im _memreg+12
    load
    im -4
    add
    load
    addsp 4
    im _memreg+12
    load
    im -4
    add
    store
    im _memreg+12
    load
    im -4
    add
    loadsp 0
    im _memreg+0
    store
    storesp 4
    storesp 8
    im 4
    pushspadd
    popsp
    im _memreg+12
    store
    poppc
    .size    go_deep, .-go_deep

Problem is it contains things like "_memreg", "impcrel" and "callpcrel" which are not ZPU ops and don't mean anything to me yet. Further more there is not a one to one correspondence between asm instructions there and actual op bytes in the finished program.

For example disassembling the compiled object with objdump gives this:


Output from objdump od a single compiled module:

00000000 <go_deep>:
   0:    00              breakpoint
   1:    00              breakpoint
   2:    00              breakpoint
   3:    00              breakpoint
   4:    00              breakpoint
   5:    08              load
   6:    02              pushsp
   7:    00              breakpoint
   8:    00              breakpoint
   9:    00              breakpoint
   a:    00              breakpoint
   b:    00              breakpoint
   c:    0c              store
   d:    fe              im -2
   e:    3d              pushspadd
   f:    0d              popsp
  10:    00              breakpoint
  11:    00              breakpoint
  12:    00              breakpoint
  13:    00              breakpoint
  14:    00              breakpoint
  15:    08              load
  16:    88              im 8
  17:    05              add
  18:    08              load
  19:    ff              im -1
  1a:    05              add
  1b:    70              loadsp 0
  1c:    52              storesp 8
  1d:    52              storesp 8
  1e:    00              breakpoint
  1f:    00              breakpoint
  20:    00              breakpoint
  21:    00              breakpoint
  22:    00              breakpoint
  23:    3f              callpcrel
  24:    00              breakpoint
  25:    00              breakpoint
  26:    00              breakpoint
  27:    00              breakpoint
  28:    00              breakpoint
  29:    08              load
  2a:    00              breakpoint
  2b:    00              breakpoint
  2c:    00              breakpoint
  2d:    00              breakpoint
  2e:    00              breakpoint
  2f:    08              load
  30:    fc              im -4
  31:    05              add
  32:    08              load
  33:    11              addsp 4
  34:    00              breakpoint
  35:    00              breakpoint
  36:    00              breakpoint
  37:    00              breakpoint
  38:    00              breakpoint
  39:    08              load
  3a:    fc              im -4
  3b:    05              add
  3c:    0c              store
  3d:    00              breakpoint
  3e:    00              breakpoint
  3f:    00              breakpoint
  40:    00              breakpoint
  41:    00              breakpoint
  42:    08              load
  43:    fc              im -4
  44:    05              add
  45:    70              loadsp 0
  46:    00              breakpoint
  47:    00              breakpoint
  48:    00              breakpoint
  49:    00              breakpoint
  4a:    00              breakpoint
  4b:    0c              store
  4c:    51              storesp 4
  4d:    52              storesp 8
  4e:    84              im 4
  4f:    3d              pushspadd
  50:    0d              popsp
  51:    00              breakpoint
  52:    00              breakpoint
  53:    00              breakpoint
  54:    00              breakpoint
  55:    00              breakpoint
  56:    0c              store
  57:    04              poppc

Whooa ! Why is that so big? And what are all those break points doing in there ?

Turns out those break points are holding places where the linker is going to put IM ops when it links the entire program.
And the linker can get rid of many of them.

So we move on and link the compiled program. The linker fills in the breakpoint slots with IM ops and if we use the -relax option it removes any redundant ones.


Output from objdump on linked program no optimization:

00000564 <go_deep>:
 564:    8c              im 12
 565:    08              load
 566:    02              pushsp
 567:    8c              im 12
 568:    0c              store
 569:    fe              im -2
 56a:    3d              pushspadd
 56b:    0d              popsp
 56c:    8c              im 12
 56d:    08              load
 56e:    88              im 8
 56f:    05              add
 570:    08              load
 571:    ff              im -1
 572:    05              add
 573:    70              loadsp 0
 574:    52              storesp 8
 575:    52              storesp 8
 576:    ed              im -19
 577:    3f              callpcrel
 578:    80              im 0
 579:    08              load
 57a:    8c              im 12
 57b:    08              load
 57c:    fc              im -4
 57d:    05              add
 57e:    08              load
 57f:    11              addsp 4
 580:    8c              im 12
 581:    08              load
 582:    fc              im -4
 583:    05              add
 584:    0c              store
 585:    8c              im 12
 586:    08              load
 587:    fc              im -4
 588:    05              add
 589:    70              loadsp 0
 58a:    80              im 0
 58b:    0c              store
 58c:    51              storesp 4
 58d:    52              storesp 8
 58e:    84              im 4
 58f:    3d              pushspadd
 590:    0d              popsp
 591:    8c              im 12
 592:    0c              store
 593:    04              poppc

Hooraay ! That looks more like what we want. Moving on again we optimize for space with -Os

00000564 <go_deep>:

Output from objdump on linked program optimized for size:

 564:    ff              im -1
 565:    3d              pushspadd
 566:    0d              popsp
 567:    ff              im -1
 568:    14              addsp 16
 569:    51              storesp 4
 56a:    f9              im -7
 56b:    3f              callpcrel
 56c:    82              im 2
 56d:    3d              pushspadd
 56e:    80              im 0
 56f:    0c              store
 570:    83              im 3
 571:    3d              pushspadd
 572:    0d              popsp
 573:    04              poppc

Job done.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2010-02-06 18:28

By the way, if you compile and link the program normally (no -relax option for the linker). Then the linker replaces those breakpoints with as many IM ops as it needs to get the immediate data in. Then replaces any spare following breakpoints with NOPs. So the code stays huge.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Bill Henning · 2010-02-06 18:58

That is a pretty good optimizer!

pushspadd pretty much has to be implemented in pasm, it will help a LOT.

I can see that long reads/writes are going to be very common, so I will implement separate messages for BYTE/WORD/LONG, which will remove a bunch of hub accesses to 'vmbytes' - however I will do this after the first version works.

The Spin API won't have to change.

I think your Spin implementation of Cogz will be a great test for my Spin API - and VMCOG [noparse]:)[/noparse]

For Cogz I think a delayed write page replacement policy will be far better than write-through.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system

heater · 2010-02-06 19:28

So far...

What I've done is to create a C version of the ZPU simulator by cutting all the emulator methods out of the ZyLin ZPU simulator which is in Java. Throwing away any extraneous junk. This way we get a good handle on the required implementation with out having to worry about correctly interpreting the documentation. For example the docs don't talk about which way the stack grows or which endian the words and longs are etc etc.

Next step will be to get some ZPU code running under it. Just now it has executed it's first instructions, NOP and BREAKPOINT.

The current challenge is figuring out how to get the GCC to compile stuff nicely and then how to extract a runnable binary image from the resulting executable. None of that process is documented very well either.

If I can get that all working then it's time to simplify the C simulator and make it look more like Spin so I can get on with a Spin version. Then the PASM version.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Bill Henning · 2010-02-06 20:00

Sounds like a plan!

I've gone ahead and re-arranged VMCOG for separate READVMB/W/L and WRITEVMB/W/L messages, updated the Spin API wrappers etc.

My plan for testing is:

- READB/WRITEB

Then once those work perfectly

- ALIGNED READW/L and WRITEW/L, cause a BUSERR to happen (so we can find software that causes unaligned access); I may add a STATUS long to the mailbox

Once those work perfectly

allow unaligned word/long access, and throw in a bit of optimization so that if the word/long is aligned, only check for presence in working set once.

heater said...
So far...

What I've done is to create a C version of the ZPU simulator by cutting all the emulator methods out of the ZyLin ZPU simulator which is in Java. Throwing away any extraneous junk. This way we get a good handle on the required implementation with out having to worry about correctly interpreting the documentation. For example the docs don't talk about which way the stack grows or which endian the words and longs are etc etc.

Next step will be to get some ZPU code running under it. Just now it has executed it's first instructions, NOP and BREAKPOINT.

The current challenge is figuring out how to get the GCC to compile stuff nicely and then how to extract a runnable binary image from the resulting executable. None of that process is documented very well either.

If I can get that all working then it's time to simplify the C simulator and make it look more like Spin so I can get on with a Spin version. Then the PASM version.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system

ImageCraft · 2010-02-06 20:25

RossH said...
Heater, Cluso ...

Having the call stack in cog appealed to me when I was writing Catalina - but it REALLY complicates things. For instance, you can probably only fit the call stack, not the stack frame (which contains all your local variables and function arguments) - otherwise it is quite possible to use all your cog stack space on one function call. But this means you actually have TWO stacks - a call stack in cog ram, and a frame stack in hub ram. The added complexity just didn't seem worth the potential savings it to me.

Ross.

Don't want to hijack the thread, which BTW, is pretty awesome in some sense - but where do you guys find the time?!!!

In any case, Ross, ICC in fact uses COG stack for function calls and locals that can fit into registers. ICC has a sophisticated register allocator so the stack pressure is not the bottleneck. This significantly increase the performance of ICC generated programs.

The lack of total memory space is of course the key limiting factor.

// richard

heater · 2010-02-06 23:14

Richard: "Where do we find the time?" Recently I don't but today the boss did not need me today and the wife was out with friends so it's bee an intense day tackling problem at hand. That might be it for a while now though.

This whole "using GCC on the Prop" idea is awesome in the sense that anyone would be crazy enough to attempt it[noparse]:)[/noparse]

I just can't escape the idea that a VM that works with byte codes must be a good fit for the existing external memory solutions that Cluso, Bill and others have worked so hard on perfecting. I really don't have the skills to take on any serious compiler development so making use of the ZPU architecture and using a ready made GCC is the best I can manage.

ImageCraft and Catalina can rest assured that whatever comes out of this is never going to challenge them on the performance stakes. Although it looks like with the help of Bills memory caching set up it might not be totally shamed.

By the way, seems to me one way to get speed out of CogZ would be to have only code in external memory. Put the stack and data in HUB and use Bills cache on the code. As the code is read only the cache never has to write anything back and then we are flying!

Actually I think that is the memory model I would like to implement first.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-02-07 01:50

heater,

Interesting - with no optimizer, Catalina generates a constant 26 instructions for 'go_deep' (i.e. 108 bytes). I don't know about ICC, but I would expect their code generator to be slightly more efficient (and hence generate slightly smaller code).

If I understand the architecture correctly, what you're saying is that the ZPU compiler may generate anywhere from 16 to 87 instructions for 'go_deep' (i.e. 16 to 87 bytes) - depending on how many IM's it takes to represent the addresses at link time. That's pretty cool - I'll reduce my code size estimates from 1/2 to 1/3 the size of C compiled PASM for real-world programs - i.e. a mix of many 1*IM cases, some 2*IM cases, and the occasional 3*IM and 4*IM case (for globals).

But I'm also going to increase my estimates for execution time - excluding the initial instruction fetch itself (which takes the same time for bytes as for longs), Zog has the added complexity of handling the multiple IM cases (which will take multiple instruction fetch cycles, unless you can somehow arrange to pre-fetch longs into a cache and then decode that as bytes - ugh!). Then there is the fact that because you have no registers, all operations (like add) place on the stack - so most of your Zog instructions need to access hub ram multiple times during execution. Excluding instruction fetches, with local registers you can 'go_deep' with only about 5 or 6 hub operations - but Zog code will require closer to 20. I think you'll be doing well to get even 4 times the speed of C compiled direct to PASM.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

Cluso99 · 2010-02-07 05:49

heater: Once it is running (even in spin) we can see where the time is spent and speed it up from there. Better optimisations can be done at the backend even if it takes longer.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz

heater · 2010-02-07 08:37

RossH, You understand correctly. But there are two optimizations going on here that can be applied independently.

One is the C compiler itself performing normal optimizations to remove redundant instructions etc. This shows up in the compiled object files. It does not change the number of bytes used for the IM instructions. They are all 5 bytes long as that is how many you need to fill a 32 bit value seven bits at a time.

The IM ops are optimized, shrunk, at link time when the "relax" option is given to the linker.

Let's look at the first instruction of go_deep. As assembler output from the compiler it looks like:

go_deep:
        im _memreg+12

Where the IM instruction is a pseudo op representing five IMs.

If we compile to an object file and disassemble it with objdump we have:

00000000 <go_deep>:
   0:   00              breakpoint
   1:   00              breakpoint
   2:   00              breakpoint
   3:   00              breakpoint
   4:   00              breakpoint

Here the compiler has reserved space for enough IMs to allow for a 32 bit immediate load.

Now we link the object into a complete executable:

000006c8 <go_deep>:
     6c8:       0b              nop
     6c9:       0b              nop
     6ca:       0b              nop
     6cb:       0b              nop
     6cc:       8c              im 12

Here the linker as plugged in the required immediate value. As it fits with 7 bits only one IM is required the rest of the space is filled with no operations.

Now we do the link again with the magic "relax" option:

00000564 <go_deep>:
 564:   8c              im 12

Here all the redundant NOPs have been removed.

Reducing the complete go_deep to 16 instructions requires use of the "relax" linker option and the -Os optimization on the compiler.

You are right about the overheads of being a stack based machine with no registers.

You are not quite right about "the initial instruction fetch itself (which takes the same time for bytes as for longs)" at least not when executing code from external memory.

This is where the ZPU has a chance to shine. For external memory having byte wide instructions is a good fit so let's put the code out in external memory, even in a serial SPI FLASH. Let's keep the stack and data in the HUB.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

RossH · 2010-02-07 09:30

heater,

Yes, I understand this particular 'go_deep' function will only ever need 16 bytes - I was generalizing to a more typical case where a function may refer to globals (which may always require 32 bits), and/or have many local variables (some of which may be structs or arrays occupying many bytes). In such cases, the number of times the IM's will occupy only one byte would be substantially reduced, and your code size correspondingly increased.

I agree about XMM - I was only considering the LMM case. In XMM code you will have a definite "fetch" advantage - say 3 to one - that will help make up for the inefficiencies of having no local registers. Whether you will end up being comparable with C compiled to PASM will depend on the efficiency of the XMM hardware.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

Toby Seckshund · 2010-02-07 10:23

Heater

Would 16 bit mem access help, or is 8 bit carved in stone?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Style and grace : Nil point

heater · 2010-02-07 10:48

Toby, I'm sure 16 bit wide access would help for lots of things even the ZPU. No reason it is carved in stone if you have a DracBlade style external memory with 16 bit or more address bus and latches why not multiplex it with 16 bit data access?

Anyway the game here is to see if ZPU is a good match for 8 bit data access on all the existing external memory solutions or even for serial access devices like SPI FLASH. Then we can have big code and save some valuable pins hopefully with a performance that is still useful.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments