Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

2456738

Comments

  • lonesocklonesock Posts: 878
    edited February 2010 Vote Up0Vote Down
    Cluso99 said...
    ZOG does not have a ring to it. BTW neither does ZPUCog. And I agree it cannot be ZyCog either. :-(
    How about Z_Cog, SoftCog (since it is sort of a soft cog), ZPCog. Nothing sounds as good as ZiCog & MoCog.
    CogZ or Cog-Z?

    Jonathan

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    lonesock
    Piranha are people too.
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • Bill HenningBill Henning Posts: 6,445
    edited February 2010 Vote Up0Vote Down
    I like CogZ... has a nice ring to it.
    lonesock said...
    Cluso99 said...
    ZOG does not have a ring to it. BTW neither does ZPUCog. And I agree it cannot be ZyCog either. :-(
    How about Z_Cog, SoftCog (since it is sort of a soft cog), ZPCog. Nothing sounds as good as ZiCog & MoCog.
    CogZ or Cog-Z?

    Jonathan
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
    www.mikronauts.com / E-mail: mikronauts _at_ gmail _dot_ com / @Mikronauts on Twitter
    RoboPi: The most advanced Robot controller for the Raspberry Pi (Propeller based)
  • jazzedjazzed Posts: 11,802
    edited February 2010 Vote Up0Vote Down
    lonesock said...
    CogZ or Cog-Z?
    Jonathan
    Much better than ZPU ... ZeePoooh! yuck! [noparse]:)[/noparse]
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    Bill: A Spin version of ZPU is definitely in the works but first I'll get the C version running so that we know we understand the ops correctly. Having a debugger would be excellent.

    Lonesock & Bill : CogZ it is then. I'll update the first post and thread title.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • RossHRossH Posts: 4,057
    edited February 2010 Vote Up0Vote Down
    @heater,

    Re: CogZ vs Catalina & Imagecraft ...

    For LMM programs CogZ won't even be in the race, since a LONG fetch from hub is the same as a BYTE fetch from hub - and once fetched, these compilers execute instructions at PASM speed. But CogZ may be competitive for XMM programs, where the benefit of fetching a byte vs a long can be significant. It will depend a bit on the hardware.

    However, I think you may have found a good solution for "big SPIN"

    I'll watch this thread with interest.

    Ross.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Catalina - a FREE C compiler for the Propeller - see Catalina
    Catalina - a FREE ANSI C compiler for the Propeller.
    Download it from http://catalina-c.sourceforge.net/
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    RossH. I would not dream of challenging Catalina or ImageCraft to a drag race "along the straight" in LMM.

    In that mode it's only benefit might be it's smaller code size as all ops are only byte wide. So let's see, could someone compile that FIBO I posted with Catalina and ImageCraft and we can see how the code size compares.

    The ZPU will fair better in speed comparison with XMM. But I fear even there it will trailing in last because it is a stack based machine with no registers so there is an awful lot of memory access going on.

    Here is an example of a simple C statement and the ZPU asm GCC generates. Annotations by me:

    int a, b, c;
    int main(int argc, char* argv[noparse][[/noparse]])  {
            a = b + c;
            return (a);
    }
    
    Generates:
    
            im b        ; Push the address of "b" to stack
            load        ; Push b to stack top using address popped from stack
            im c        ; Push the address of "c" to stack
            load        ; Push c to stack using address popped
            add         ; Pop and add the top two stack items and push result to stack
            im a        ; Push the address of "a" to stack
            store      ; Pop "a" address then "a" and store "a" to the address
    
    



    Now: An "IM b" can be done in one byte opcode read + 4 byte writes for the PUSH to stack. Total 5 memory accesses.
    A LOAD or STORE is one byte read for the op + two times PUSH/POP at 4 bytes each + a memory read of 4 bytes. Total 13 accesses.
    The ADD is one byte for the op + two POPS and a PUSH for a total of 13 accesses.

    I make that a grand total of 67 byte accesses to external RAM to add two 32 bit numbers !!!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • Mike GreenMike Green Posts: 22,459
    edited February 2010 Vote Up0Vote Down
    Is there a particular reason why the stack has to be in XMM as opposed to Hub RAM?

    If there needs to be a single uniform address space, how about placing Hub RAM at one end of the address space and pointing the stack pointer there.

    Post Edited (Mike Green) : 2/6/2010 6:29:19 AM GMT
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    A very good question Mike and one I have already pondered.

    I suspect it would be quite possible to keep the stack in HUB. That would save us 36 external memory accesses in the above example. Halving the external memory thrashing!!

    In the general case of course it is quite possible for ZPU to do LOAD and STORE to somewhere within the stack in which case a stack in HUB would fail unless we did some checking of the LOAD/STORE addresses every time and sorted things out accordingly.

    I suspect that normal everyday GCC generated code will never do that and we could get away with it. In which case we start closing in on Catalina and Image Craft performance for XMM.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • RossHRossH Posts: 4,057
    edited February 2010 Vote Up0Vote Down
    heater,

    Mike is right - your stack should always be in hub RAM. Catalina's LARGE memory mode has a flat address space, but the first $8000 addresses are the normal hub RAM addresses and the stack is always allocated in this address range.

    Time/space tradeoffs always fascinate me. Your addresses will presumably be 32 bits, so the code will not be 1/4 the size of Catalina's - but it could be 1/2 the size. Whether it is worth it depends on how much speed you lose. I'll be interested to see the results.

    Ross.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Catalina - a FREE C compiler for the Propeller - see Catalina
    Catalina - a FREE ANSI C compiler for the Propeller.
    Download it from http://catalina-c.sourceforge.net/
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    Mike, I posted as you edited your post. Yes I think that's what I meant by "checking" and "sorting" the LOAD store accesses addresses. That checking takes a bit of PASM but then the gains are still worth it.

    I reckon on the first ZPU implementation to be in SPIN and no XMM so the stack will already be in the right place, in HUB, then we can experiment with moving the code and then data to XMM.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.

    Post Edited (heater) : 2/6/2010 9:46:21 AM GMT
  • jazzedjazzed Posts: 11,802
    edited February 2010 Vote Up0Vote Down
    BTW: AFAIK there is no ImageCraft XMM solution today. I got something working but lost interest after fighting the LMM kernel quirks for so long.
  • Cluso99Cluso99 Posts: 12,737
    edited February 2010 Vote Up0Vote Down
    heater: since the ZPU is only tiny, I wonder if there would be space in the cog for the stack? I know it is a tall ask, but it could speed things significantly if it could fit.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    Blimey, The first version of CogZ does not exist yet and we already have a dozen optimizations on the table[noparse]:)[/noparse]

    The Stack in Cog may also be possible. Whilst the base ZPU is tiny it's probably worth more to fill up the COG with PASM implementations of the ZPU ops that would otherwise be emulated in ZPU ASM.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • RossHRossH Posts: 4,057
    edited February 2010 Vote Up0Vote Down
    Heater, Cluso ...

    Having the call stack in cog appealed to me when I was writing Catalina - but it REALLY complicates things. For instance, you can probably only fit the call stack, not the stack frame (which contains all your local variables and function arguments) - otherwise it is quite possible to use all your cog stack space on one function call. But this means you actually have TWO stacks - a call stack in cog ram, and a frame stack in hub ram. The added complexity just didn't seem worth the potential savings it to me.

    Ross.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Catalina - a FREE C compiler for the Propeller - see Catalina
    Catalina - a FREE ANSI C compiler for the Propeller.
    Download it from http://catalina-c.sourceforge.net/
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    I really don't want to get into the complication of separate call and data stacks. I'd rather make the restriction that if you want fast code then you don't put lots of local data in your functions.

    So what we need is a flat memory space presented to ZPU via an interface that maps some of it to to HUB and some of it to external RAM. So if you use a lot of stack you will find the code slowing down dramatically.

    Problem: I've just been looking at the Java ZPU simulator and it implements a stack that grows upwards!. This seems to complicate that idea somewhat.

    RossH: "Your addresses will presumably be 32 bits, so the code will not be 1/4 the size of Catalina's"

    ZPU has an interesting feature. In the code snippet I posted above you will see things like "im b" where it is loading the address of "b" from immediate data. Thing is that "im" is a one byte instruction, the top bit "1" indicates "im" and the other seven bits are the immediate data. So small addresses can be loaded with a single one byte "im". For larger addresses just chain "im"s together and it will load another 7 bits and another 7 bits etc. In this way for small data sets the code can be quite small.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • Bill HenningBill Henning Posts: 6,445
    edited February 2010 Vote Up0Vote Down
    Interesting...

    Can you compile the following:

    int go deep(int n) {
    int b;
    b += go_deep(n-1);
    return &b;
    }

    It is a nonsense function, but it should stop the optimizer from mucking too much with the code.

    I am interested in how it handles references to addresses of local variables in C functions. Reason: It might be difficult to put the stack into the hub, depending on how it handles it.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
    www.mikronauts.com / E-mail: mikronauts _at_ gmail _dot_ com / @Mikronauts on Twitter
    RoboPi: The most advanced Robot controller for the Raspberry Pi (Propeller based)
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    Bill, that turned out to be a deep question [noparse]:)[/noparse]

    This is long so please persevere.

    You if you compile a single module with -S to just get an assembler listing you get this:

    go_deep:
        im _memreg+12
        load
        pushsp
        im _memreg+12
        store
        im -2
        pushspadd
        popsp
        im _memreg+12
        load
        im 8
        add
        load
        im -1
        add
        loadsp 0
        storesp 8
        storesp 8
        impcrel (go_deep)
        callpcrel
        im _memreg+0
        load
        im _memreg+12
        load
        im -4
        add
        load
        addsp 4
        im _memreg+12
        load
        im -4
        add
        store
        im _memreg+12
        load
        im -4
        add
        loadsp 0
        im _memreg+0
        store
        storesp 4
        storesp 8
        im 4
        pushspadd
        popsp
        im _memreg+12
        store
        poppc
        .size    go_deep, .-go_deep
    
    



    Problem is it contains things like "_memreg", "impcrel" and "callpcrel" which are not ZPU ops and don't mean anything to me yet. Further more there is not a one to one correspondence between asm instructions there and actual op bytes in the finished program.

    For example disassembling the compiled object with objdump gives this:

    
    Output from objdump od a single compiled module:
    
    00000000 <go_deep>:
       0:    00              breakpoint
       1:    00              breakpoint
       2:    00              breakpoint
       3:    00              breakpoint
       4:    00              breakpoint
       5:    08              load
       6:    02              pushsp
       7:    00              breakpoint
       8:    00              breakpoint
       9:    00              breakpoint
       a:    00              breakpoint
       b:    00              breakpoint
       c:    0c              store
       d:    fe              im -2
       e:    3d              pushspadd
       f:    0d              popsp
      10:    00              breakpoint
      11:    00              breakpoint
      12:    00              breakpoint
      13:    00              breakpoint
      14:    00              breakpoint
      15:    08              load
      16:    88              im 8
      17:    05              add
      18:    08              load
      19:    ff              im -1
      1a:    05              add
      1b:    70              loadsp 0
      1c:    52              storesp 8
      1d:    52              storesp 8
      1e:    00              breakpoint
      1f:    00              breakpoint
      20:    00              breakpoint
      21:    00              breakpoint
      22:    00              breakpoint
      23:    3f              callpcrel
      24:    00              breakpoint
      25:    00              breakpoint
      26:    00              breakpoint
      27:    00              breakpoint
      28:    00              breakpoint
      29:    08              load
      2a:    00              breakpoint
      2b:    00              breakpoint
      2c:    00              breakpoint
      2d:    00              breakpoint
      2e:    00              breakpoint
      2f:    08              load
      30:    fc              im -4
      31:    05              add
      32:    08              load
      33:    11              addsp 4
      34:    00              breakpoint
      35:    00              breakpoint
      36:    00              breakpoint
      37:    00              breakpoint
      38:    00              breakpoint
      39:    08              load
      3a:    fc              im -4
      3b:    05              add
      3c:    0c              store
      3d:    00              breakpoint
      3e:    00              breakpoint
      3f:    00              breakpoint
      40:    00              breakpoint
      41:    00              breakpoint
      42:    08              load
      43:    fc              im -4
      44:    05              add
      45:    70              loadsp 0
      46:    00              breakpoint
      47:    00              breakpoint
      48:    00              breakpoint
      49:    00              breakpoint
      4a:    00              breakpoint
      4b:    0c              store
      4c:    51              storesp 4
      4d:    52              storesp 8
      4e:    84              im 4
      4f:    3d              pushspadd
      50:    0d              popsp
      51:    00              breakpoint
      52:    00              breakpoint
      53:    00              breakpoint
      54:    00              breakpoint
      55:    00              breakpoint
      56:    0c              store
      57:    04              poppc
    
    



    Whooa ! Why is that so big? And what are all those break points doing in there ?

    Turns out those break points are holding places where the linker is going to put IM ops when it links the entire program.
    And the linker can get rid of many of them.

    So we move on and link the compiled program. The linker fills in the breakpoint slots with IM ops and if we use the -relax option it removes any redundant ones.

    
    Output from objdump on linked program no optimization:
    
    00000564 <go_deep>:
     564:    8c              im 12
     565:    08              load
     566:    02              pushsp
     567:    8c              im 12
     568:    0c              store
     569:    fe              im -2
     56a:    3d              pushspadd
     56b:    0d              popsp
     56c:    8c              im 12
     56d:    08              load
     56e:    88              im 8
     56f:    05              add
     570:    08              load
     571:    ff              im -1
     572:    05              add
     573:    70              loadsp 0
     574:    52              storesp 8
     575:    52              storesp 8
     576:    ed              im -19
     577:    3f              callpcrel
     578:    80              im 0
     579:    08              load
     57a:    8c              im 12
     57b:    08              load
     57c:    fc              im -4
     57d:    05              add
     57e:    08              load
     57f:    11              addsp 4
     580:    8c              im 12
     581:    08              load
     582:    fc              im -4
     583:    05              add
     584:    0c              store
     585:    8c              im 12
     586:    08              load
     587:    fc              im -4
     588:    05              add
     589:    70              loadsp 0
     58a:    80              im 0
     58b:    0c              store
     58c:    51              storesp 4
     58d:    52              storesp 8
     58e:    84              im 4
     58f:    3d              pushspadd
     590:    0d              popsp
     591:    8c              im 12
     592:    0c              store
     593:    04              poppc
    
    



    Hooraay ! That looks more like what we want. Moving on again we optimize for space with -Os

    00000564 <go_deep>:
    
    Output from objdump on linked program optimized for size:
    
     564:    ff              im -1
     565:    3d              pushspadd
     566:    0d              popsp
     567:    ff              im -1
     568:    14              addsp 16
     569:    51              storesp 4
     56a:    f9              im -7
     56b:    3f              callpcrel
     56c:    82              im 2
     56d:    3d              pushspadd
     56e:    80              im 0
     56f:    0c              store
     570:    83              im 3
     571:    3d              pushspadd
     572:    0d              popsp
     573:    04              poppc
    
    



    Job done.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    By the way, if you compile and link the program normally (no -relax option for the linker). Then the linker replaces those breakpoints with as many IM ops as it needs to get the immediate data in. Then replaces any spare following breakpoints with NOPs. So the code stays huge.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • Bill HenningBill Henning Posts: 6,445
    edited February 2010 Vote Up0Vote Down
    That is a pretty good optimizer!

    pushspadd pretty much has to be implemented in pasm, it will help a LOT.

    I can see that long reads/writes are going to be very common, so I will implement separate messages for BYTE/WORD/LONG, which will remove a bunch of hub accesses to 'vmbytes' - however I will do this after the first version works.

    The Spin API won't have to change.

    I think your Spin implementation of Cogz will be a great test for my Spin API - and VMCOG [noparse]:)[/noparse]

    For Cogz I think a delayed write page replacement policy will be far better than write-through.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
    www.mikronauts.com / E-mail: mikronauts _at_ gmail _dot_ com / @Mikronauts on Twitter
    RoboPi: The most advanced Robot controller for the Raspberry Pi (Propeller based)
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    So far...

    What I've done is to create a C version of the ZPU simulator by cutting all the emulator methods out of the ZyLin ZPU simulator which is in Java. Throwing away any extraneous junk. This way we get a good handle on the required implementation with out having to worry about correctly interpreting the documentation. For example the docs don't talk about which way the stack grows or which endian the words and longs are etc etc.

    Next step will be to get some ZPU code running under it. Just now it has executed it's first instructions, NOP and BREAKPOINT.

    The current challenge is figuring out how to get the GCC to compile stuff nicely and then how to extract a runnable binary image from the resulting executable. None of that process is documented very well either.

    If I can get that all working then it's time to simplify the C simulator and make it look more like Spin so I can get on with a Spin version. Then the PASM version.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • Bill HenningBill Henning Posts: 6,445
    edited February 2010 Vote Up0Vote Down
    Sounds like a plan!

    I've gone ahead and re-arranged VMCOG for separate READVMB/W/L and WRITEVMB/W/L messages, updated the Spin API wrappers etc.

    My plan for testing is:

    - READB/WRITEB

    Then once those work perfectly

    - ALIGNED READW/L and WRITEW/L, cause a BUSERR to happen (so we can find software that causes unaligned access); I may add a STATUS long to the mailbox

    Once those work perfectly

    allow unaligned word/long access, and throw in a bit of optimization so that if the word/long is aligned, only check for presence in working set once.
    heater said...
    So far...

    What I've done is to create a C version of the ZPU simulator by cutting all the emulator methods out of the ZyLin ZPU simulator which is in Java. Throwing away any extraneous junk. This way we get a good handle on the required implementation with out having to worry about correctly interpreting the documentation. For example the docs don't talk about which way the stack grows or which endian the words and longs are etc etc.

    Next step will be to get some ZPU code running under it. Just now it has executed it's first instructions, NOP and BREAKPOINT.

    The current challenge is figuring out how to get the GCC to compile stuff nicely and then how to extract a runnable binary image from the resulting executable. None of that process is documented very well either.

    If I can get that all working then it's time to simplify the C simulator and make it look more like Spin so I can get on with a Spin version. Then the PASM version.
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
    www.mikronauts.com / E-mail: mikronauts _at_ gmail _dot_ com / @Mikronauts on Twitter
    RoboPi: The most advanced Robot controller for the Raspberry Pi (Propeller based)
  • ImageCraftImageCraft Posts: 348
    edited February 2010 Vote Up0Vote Down
    RossH said...
    Heater, Cluso ...

    Having the call stack in cog appealed to me when I was writing Catalina - but it REALLY complicates things. For instance, you can probably only fit the call stack, not the stack frame (which contains all your local variables and function arguments) - otherwise it is quite possible to use all your cog stack space on one function call. But this means you actually have TWO stacks - a call stack in cog ram, and a frame stack in hub ram. The added complexity just didn't seem worth the potential savings it to me.

    Ross.

    Don't want to hijack the thread, which BTW, is pretty awesome in some sense - but where do you guys find the time?!!!

    In any case, Ross, ICC in fact uses COG stack for function calls and locals that can fit into registers. ICC has a sophisticated register allocator so the stack pressure is not the bottleneck. This significantly increase the performance of ICC generated programs.

    The lack of total memory space is of course the key limiting factor.

    // richard
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    Richard: "Where do we find the time?" Recently I don't but today the boss did not need me today and the wife was out with friends so it's bee an intense day tackling problem at hand. That might be it for a while now though.

    This whole "using GCC on the Prop" idea is awesome in the sense that anyone would be crazy enough to attempt it[noparse]:)[/noparse]

    I just can't escape the idea that a VM that works with byte codes must be a good fit for the existing external memory solutions that Cluso, Bill and others have worked so hard on perfecting. I really don't have the skills to take on any serious compiler development so making use of the ZPU architecture and using a ready made GCC is the best I can manage.

    ImageCraft and Catalina can rest assured that whatever comes out of this is never going to challenge them on the performance stakes. Although it looks like with the help of Bills memory caching set up it might not be totally shamed.

    By the way, seems to me one way to get speed out of CogZ would be to have only code in external memory. Put the stack and data in HUB and use Bills cache on the code. As the code is read only the cache never has to write anything back and then we are flying!

    Actually I think that is the memory model I would like to implement first.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • RossHRossH Posts: 4,057
    edited February 2010 Vote Up0Vote Down
    heater,

    Interesting - with no optimizer, Catalina generates a constant 26 instructions for 'go_deep' (i.e. 108 bytes). I don't know about ICC, but I would expect their code generator to be slightly more efficient (and hence generate slightly smaller code).

    If I understand the architecture correctly, what you're saying is that the ZPU compiler may generate anywhere from 16 to 87 instructions for 'go_deep' (i.e. 16 to 87 bytes) - depending on how many IM's it takes to represent the addresses at link time. That's pretty cool - I'll reduce my code size estimates from 1/2 to 1/3 the size of C compiled PASM for real-world programs - i.e. a mix of many 1*IM cases, some 2*IM cases, and the occasional 3*IM and 4*IM case (for globals).

    But I'm also going to increase my estimates for execution time - excluding the initial instruction fetch itself (which takes the same time for bytes as for longs), Zog has the added complexity of handling the multiple IM cases (which will take multiple instruction fetch cycles, unless you can somehow arrange to pre-fetch longs into a cache and then decode that as bytes - ugh!). Then there is the fact that because you have no registers, all operations (like add) place on the stack - so most of your Zog instructions need to access hub ram multiple times during execution. Excluding instruction fetches, with local registers you can 'go_deep' with only about 5 or 6 hub operations - but Zog code will require closer to 20. I think you'll be doing well to get even 4 times the speed of C compiled direct to PASM.

    Ross.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Catalina - a FREE C compiler for the Propeller - see Catalina
    Catalina - a FREE ANSI C compiler for the Propeller.
    Download it from http://catalina-c.sourceforge.net/
  • Cluso99Cluso99 Posts: 12,737
    edited February 2010 Vote Up0Vote Down
    heater: Once it is running (even in spin) we can see where the time is spent and speed it up from there. Better optimisations can be done at the backend even if it takes longer.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    RossH, You understand correctly. But there are two optimizations going on here that can be applied independently.

    One is the C compiler itself performing normal optimizations to remove redundant instructions etc. This shows up in the compiled object files. It does not change the number of bytes used for the IM instructions. They are all 5 bytes long as that is how many you need to fill a 32 bit value seven bits at a time.

    The IM ops are optimized, shrunk, at link time when the "relax" option is given to the linker.

    Let's look at the first instruction of go_deep. As assembler output from the compiler it looks like:

    go_deep:
            im _memreg+12
    
    



    Where the IM instruction is a pseudo op representing five IMs.

    If we compile to an object file and disassemble it with objdump we have:

    00000000 <go_deep>:
       0:   00              breakpoint
       1:   00              breakpoint
       2:   00              breakpoint
       3:   00              breakpoint
       4:   00              breakpoint
    
    



    Here the compiler has reserved space for enough IMs to allow for a 32 bit immediate load.

    Now we link the object into a complete executable:

    000006c8 <go_deep>:
         6c8:       0b              nop
         6c9:       0b              nop
         6ca:       0b              nop
         6cb:       0b              nop
         6cc:       8c              im 12
    
    



    Here the linker as plugged in the required immediate value. As it fits with 7 bits only one IM is required the rest of the space is filled with no operations.

    Now we do the link again with the magic "relax" option:

    00000564 <go_deep>:
     564:   8c              im 12
    
    



    Here all the redundant NOPs have been removed.

    Reducing the complete go_deep to 16 instructions requires use of the "relax" linker option and the -Os optimization on the compiler.


    You are right about the overheads of being a stack based machine with no registers.

    You are not quite right about "the initial instruction fetch itself (which takes the same time for bytes as for longs)" at least not when executing code from external memory.

    This is where the ZPU has a chance to shine. For external memory having byte wide instructions is a good fit so let's put the code out in external memory, even in a serial SPI FLASH. Let's keep the stack and data in the HUB.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • RossHRossH Posts: 4,057
    edited February 2010 Vote Up0Vote Down
    heater,

    Yes, I understand this particular 'go_deep' function will only ever need 16 bytes - I was generalizing to a more typical case where a function may refer to globals (which may always require 32 bits), and/or have many local variables (some of which may be structs or arrays occupying many bytes). In such cases, the number of times the IM's will occupy only one byte would be substantially reduced, and your code size correspondingly increased.

    I agree about XMM - I was only considering the LMM case. In XMM code you will have a definite "fetch" advantage - say 3 to one - that will help make up for the inefficiencies of having no local registers. Whether you will end up being comparable with C compiled to PASM will depend on the efficiency of the XMM hardware.

    Ross.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Catalina - a FREE C compiler for the Propeller - see Catalina
    Catalina - a FREE ANSI C compiler for the Propeller.
    Download it from http://catalina-c.sourceforge.net/
  • Toby SeckshundToby Seckshund Posts: 2,010
    edited February 2010 Vote Up0Vote Down
    Heater

    Would 16 bit mem access help, or is 8 bit carved in stone?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Style and grace : Nil point
  • heaterheater Posts: 3,370
    edited February 2010 Vote Up0Vote Down
    Toby, I'm sure 16 bit wide access would help for lots of things even the ZPU. No reason it is carved in stone if you have a DracBlade style external memory with 16 bit or more address bus and latches why not multiplex it with 16 bit data access?

    Anyway the game here is to see if ZPU is a good match for 8 bit data access on all the existing external memory solutions or even for serial access devices like SPI FLASH. Then we can have big code and save some valuable pins hopefully with a performance that is still useful.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
Sign In or Register to comment.