Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Bill Henning · 2011-05-24 07:12

Agreed!

Also see post #252 in http://forums.parallax.com/showthread.php?131477-GCC-Eclipse-and-Propeller-2-seeking-developers/page13 ... using the CLUT as the return stack and expression evaluation stack for ZOG (and Spin) should result in quite a performance boost, as it would avoid hub accesses not only for call/return, but also for every push/pop used in expression evaluation!

Note that stack frames for arguments and local variables still need to be kept in the hub, but the new indexed hub rd/wr instructions (once documented... please prepare the dental chair and your dremel, and assist Kye with the information extration) will help greatly there.

jazzed wrote: »

Looks like a great day for ZOG. Thanks Andrey.

jazzed · 2011-05-24 08:20

Bill Henning wrote: »

Agreed!

Such enthusiasm! You can probably stop waving your hands now. I'm sure everyone has noticed

Will we need Zog on Propeller 2 when a "native" GCC tool-chain can be used there?

Bill Henning · 2011-05-24 08:42

jazzed wrote: »

Such enthusiasm! You can probably stop waving your hands now. I'm sure everyone has noticed

LOL... good, my arms were getting tired...

jazzed wrote: »

Will we need Zog on Propeller 2 when a "native" GCC tool-chain can be used there?

ZOG will be very useful for memory constrained applications.

I expect LMM2 gcc generated code to be much faster than ZOG byte codes (10x-20x), however ZOG will enjoy a 2x-6x code size advantage even with the hub addressing modes.

They will make a very useful combination - somewhat akin to the current SPIN/PASM pairing, I could see a ZOG/LMM2 pairing where the large "business logic" code is compiled for ZOG, and the fast code compiled for LMM2. For the remaining cases that need the last possible drop of speed, optimized hand written PASM2 can be used for potential applications such as soft 10mbps ethernet, 12mbps USB, 3D graphics engines etc.

p.s.

The race I anticipate between Catalina and GCC on the Prop2 will improve both

lonesock · 2011-05-24 08:53

Given that the simplest LMM loop is pretty small, it might be worth having a LMM interpreter in Zog, if it fits, for the equivalent of inline assembly. Would this be doable?

Jonathan

Bill Henning · 2011-05-24 08:56

That should be pretty easy... and I also want to see that in the new Spin VM.

Based on the information from the old thread, with a butchered syntax (I don't know the new opcode style) it would look something like this:

next   rdlong inst,(pc)++    ' one access per 8 clock cycles, six single cycle instructions fit between hub windows
          nop    ' delay slot
inst     nop    ' space for instruction
          nop    ' wasted slot
          nop    ' wasted slot
          nop    ' wasted slot
          jmp    #next

about 30 longs of as few primitives as possible primitives follow

ie FJMP, FCALL, FRET, IMM32 is the minimum subset I'd recommend. No space for FCACHE, unless VM's use it internally too.

Currently I am thinking of uses for the delay and wasted slots... so far profiling, breakpoints and other debugging assists come readely to mind.

My preliminary LMM2 vm had some primitives defined, but I am waiting for the "official" RDxxxx extensions to be documented for me before putting significantly more work into the microcode. I do plan an option to use the CLUT as a hardware return address only stack for LMM2.

I have more ideas for more hardware assist to make LMM2 faster, but every one would kill me if I asked for them, as they would delay Prop2, and probably require too much silicon (thus reducing the ram). Fortunately with the new design technology Parallax is using, it will be possible to keep updating Prop2 whenever financially feasible/desirable.

lonesock wrote: »

Given that the simplest LMM loop is pretty small, it might be worth having a LMM interpreter in Zog, if it fits, for the equivalent of inline assembly. Would this be doable?

Jonathan

Heater. · 2011-05-24 11:30

Bill Henning,

...using the CLUT as the return stack and expression evaluation stack for ZOG (and Spin) should result in quite a performance boost, as it would avoid hub accesses not only for call/return, but also for every push/pop used in expression evaluation!

True but that's not going to happen unless someone rewrites the zpu-gcc C compiler to use the stacks that way. Separating out the return stack and/or the expression stack and/or automatic variable stack is not something done so easily in the VM (Is it?) Surely the intentions of the compiler are unknown to the VM.

Lonesock,

...it might be worth having a LMM interpreter in Zog, if it fits, for the equivalent of inline assembly. Would this be doable?

Interesting idea. Doable? No idea:) Last time I checked there was not much space left on the Zog COG.

Jazzed,

Will we need Zog on Propeller 2 when a "native" GCC tool-chain can be used there?

You don't ask if we need 1000ft space alien. He just is:) I'm still hoping that one day Zog is going to outpace kernels relying on 32 bit PASM instructions when code is fetched from external memory whilst data/stack is in HUB.

Anyway Zog is going to enjoy the more spacious accommodation provided by Prop II.

Bill Henning · 2011-05-24 11:45

Hi Heater,

Based on my fading memory of reading the ZOG VM before I suggested the TOS/NOS etc optimization, my gut feeling is that putting the expression evaluation stack into the CLUT will be easy, given the intentions of the compiler need not be known for this optimization.

I don't think the return stack would be too tough either as long as the return address is currently placed on the stack with a unique call instruction, and popped with a return instruction.

I would not even think of storing parameters and locals there for C/C++/Java code (imagine endless horrible debugging issues due to quickly running out of stack...BRRRRR...) although as David pointed out, that should work for Basic where deep function trees and many parameters/locals are not common.

Heater. wrote: »

Bill Henning,

True but that's not going to happen unless someone rewrites the zpu-gcc C compiler to use the stacks that way. Separating out the return stack and/or the expression stack and/or automatic variable stack is not something done so easily in the VM (Is it?) Surely the intentions of the compiler are unknown to the VM.

lonesock · 2011-05-24 11:46

Sorry for the dumb question, but I'm having trouble finding the latest version...is there a repo or something? Is there a link on the 1st post that I missed? (There is a *ton* of information in this thread!!)

Jonathan

Heater. · 2011-05-24 11:54

lonesock,

On page 26 of this thread there is attached zog_v1.6. I pretty sure that's the last version I put out. There was a 1.6 with no SD card support when using VMCOG which I made for Bill at some point but I'm not sure what happened to that. I would have attached these to the first post but I can't edit posts under my old "heater" user name.

Recently there is Dave Betz releases to be found here http://forums.parallax.com/showthread.php?129672-ZOG-GCC-C-C-Project-in-Google-Code-%28propeller-zpu-vm%29 much more sophisticated.

I think, though that I will be soon tinkering with my old 1.6 version to try out Andrey's little endian compiler and such.

Edit: I was to release a 1.7 with your F32 support a while back. Must remember to finish that up too.

Edit: A repo would be a damn good idea.

Heater. · 2011-05-24 17:45

Bill Henning,

Based on my fading memory of reading the ZOG VM before I suggested the TOS/NOS etc optimization, my gut feeling is that putting the expression evaluation stack into the CLUT will be easy, given the intentions of the compiler need not be known for this optimization.

I don't see how it is possible. Let's forget the compiler for a moment and assume we are writing ZPU assembler by hand. Then I can always write a sequence like:

IM      AnAddress       ; Load AnAddress from an immediate value and push to stack
LOAD                    ; Pop AnAddress from stack, load value at AnAddress and push it to stack

Now, AnAddress can be anywhere in the address space. Could be in a code area or constants or data or even somewhere in the stack. If it's in the stack it could be fetching a local variable or a return address or some expression intermediate value, anything.

So there is the problem. If we try to put the return stack and/or expression evaluation stack for ZOG into the CLUT we have just broken the above instruction sequence. For example if AnAddress happened to be a return address on the CLUT stack it is not available to LOAD.

I'm pretty sure the C compiler is capable of generating such sequences.

Or have I, as is often the case, missed a point again?

...however ZOG will enjoy a 2x-6x code size advantage [over LMM]...

Some how that has not come to fruition yet. Not sure if Zog is generating big code or if we still just including a lot of redundant run time junk in the binaries. One size issue concerns unsigned div and mod for which Zog includes some long winded functions as it does not have opcodes for them.

Bill Henning · 2011-05-24 19:37

Hi Heater,

By joe, I think you are right, at least about expressions!

I forgot that LOAD essetially does TOS := @TOS, which is then reduced to RDLONG TOS,TOS

I still think the return stack would be faster in the CLUT, as that does not do an indirect hub reference... or does it?

As far as code size, building 32 bit constants would be a major culprit.

Heater. wrote: »
Bill Henning,

I don't see how it is possible. Let's forget the compiler for a moment and assume we are writing ZPU assembler by hand. Then I can always write a sequence like:
IM      AnAddress       ; Load AnAddress from an immediate value and push to stack
LOAD                    ; Pop AnAddress from stack, load value at AnAddress and push it to stack 
Now, AnAddress can be anywhere in the address space. Could be in a code area or constants or data or even somewhere in the stack. If it's in the stack it could be fetching a local variable or a return address or some expression intermediate value, anything.

So there is the problem. If we try to put the return stack and/or expression evaluation stack for ZOG into the CLUT we have just broken the above instruction sequence. For example if AnAddress happened to be a return address on the CLUT stack it is not available to LOAD.

I'm pretty sure the C compiler is capable of generating such sequences.

Or have I, as is often the case, missed a point again?

...however ZOG will enjoy a 2x-6x code size advantage [over LMM]...

Some how that has not come to fruition yet. Not sure if Zog is generating big code or if we still just including a lot of redundant run time junk in the binaries. One size issue concerns unsigned div and mod for which Zog includes some long winded functions as it does not have opcodes for them.

Heater. · 2011-05-25 00:33

Bill Henning,

As far as code size, building 32 bit constants would be a major culprit.

Yes that is a bit of a killer. However it's a bit clever about it. The compiler will always generate space for five IM instructions to load any 32 bit constant even if it is a small number like 1. Then the linker, when used with the "relax" option, inserts only the number of IM's required for the constant "IM 1" say. The rest of the code being moved down to occupy the now redundant 4 instruction spaces. It will even do clever things like:

IM        1
NEG

To construct large constants with fewer instructions.

Heater. · 2011-05-25 01:39

Andrey,

Just built your little-endian patched compiler on Debian here. Compiled a little test.c like yours. So far the output looks good. The entire executable is only 29 bytes now for a main function, a _premain function and crt0. Pretty damn good.

Later today I'll tweak with my ZPU simulator in C and see if we can run something with it.

Bill Henning · 2011-05-25 07:38

Hi Heater,

That's not too bad. Pity about the five instruction decode overheads for a full 32 bit long :-(

OH! (flashing lightbulb)

*if* there is room in the ZOG VM

Look-ahead to see if the next instruction is also an IM. If it is, process it without going back to the whole instruction decode loop.

If it is not, just jump into the op decode loop with the op already fetched.

This way it should speed up IM loads ~20%+ without a speed penalty on other instructions.

I think.

Maybe.

Worth looking into anyway...

(unless you already added this earlier - I have not looked at the ZOG VM in ages)

Heater. wrote: »
Bill Henning,

Yes that is a bit of a killer. However it's a bit clever about it. The compiler will always generate space for five IM instructions to load any 32 bit constant even if it is a small number like 1. Then the linker, when used with the "relax" option, inserts only the number of IM's required for the constant "IM 1" say. The rest of the code being moved down to occupy the now redundant 4 instruction spaces. It will even do clever things like:
IM        1
NEG
To construct large constants with fewer instructions.

Heater. · 2011-05-25 12:42

Bill,
You are a wicked genius, the kind of guy who keeps us mere mortals busy checking if these ideas work or not:)
First we have to check Andrey's little-endian compiler works.

Bill Henning · 2011-05-25 13:01

*blush*

Thank you.

I do have to credit my CS education, and my hobby of trying to do pencil designs of "the ultimate processor" - which means about 30 years of research on pretty much every major processor architecture - past, present and future (when published) - and trying to figure out how to do it better.

I got this bug in senior high school, after I played with a Cosmac Elf - which had a *horrible* instruction set compared to the 8085, 6502, 6800 and Z80 of the time. I looked on the 1802 and thought *I* could do better.... now I realize that back then I could not have pulled it off.

Heater. wrote: »

Bill,
You are a wicked genius, the kind of guy who keeps us mere mortals busy checking if these ideas work or not:)
First we have to check Andrey's little-endian compiler works.

Heater. · 2011-05-25 14:01

Bill,

"CS education" Looks like that was back in the day when "CS" meant more than programming in Java:)

Now, did I ever tell you that LMM sucks?

No, I don't mean that, but the Prop architecture just does not suit high level languages like C (or anything from FORTRAN/ALGOL upwards) being compiled to "native" PASM. I start to think that the more we bend the Prop that way the more it would become an ARM or x86 or MIPs or whatever. And the less it is the Prop that we know and love. Why go down that well trodden road again especially when the competition is already so big?

Perhaps it would be better to use the silicon of Prop II / III to implement a Spin byte code execution engine along side the PASM, or dare I say it a ZPU engine. It's designed for minimal logic usage anyway.

Sorry I've just had a beer or two and start looking at things sideways:)

Hmm...Cosmac Elf. Some how I missed that one. The 6809 was my favourite. Wonder where one can get an 1802 now.

Bill Henning · 2011-05-25 15:30

Yep, CS meant learning a LOT back in '82.

Mind you, I coasted until I hit some third, then later fourth, year CS courses.

Back then, they taught you a whole passel of languages: PL/1, Pascal, Modula2, APL, C, Lisp, Prolog, a couple assembly languages (I took every processor / system architecture course I could find.

I was totally disgusted with freshly hatched CS graduates starting in the late ninety's when hiring - oh sure, they wrote pretty commented object oriented code, but not a clue about system architecture, pointers, or god forbid, assembly language. How the heck could they think they could do embedded code?

Actually that dissatisfaction lead to Morpheus, PropCade, and my new educational products; perhaps I like charging at tilting windmills... but I want the next generation of kids to get the same excitement we got out of hands-on immediate programming without loading 4GB of bloat.

Absolutely. LMM on Prop1 sucks performance wise.

Mind you, when I designed it, I figured that people would use it the way I intended it to - that is put anything that loops (and fits into up to 256 longs) into an FCACHE block. Some initial testing showed that it could then reach a rather high percentage of pure pasm speed.

Even then, you are correct - the lack of indexed hub addressing, indexed cog addressing, and more, kills performance even then.

Fortunately enough people suggested it, and I am sure Chip recognized it himself, that Chip added the needed missing bits to Prop2 so that it *should* (famous last words) be pretty fast.

Check my posting on the GCC thread in the ParallaxSemiconductor forum, I put up a quick post about the other LMM variants I tried (cycle counting on paper!) to figure out what would be best for Prop2. And once we finally get docs on the index register / addressing modes, I should be able to finalize my current design. Dave is updating his simulator for the Prop2 instruction set, so I should be able to get it running before I can get silicon - mind you, my wife may bop me on the head if I don't finish existing projects and documentation first!

Re/ 1802 ... it would be an easy one to emulate, should fit in one cog, and be MUCH faster than the original.

I loved the 6809 as well, what a lovely clean 16 bit architecture! (I know, really they only said 8 - but give me a break, with the register structure, it really was a 16 bit processor)

Cluso99 · 2011-05-25 17:41

Oh Bill, you make me feel so old! CS courses were not available when I started in computers. All the courses were run by computer companies and I actually started in hardware before transferring to software. Anyway enough of that.

We were always trying to fit programs into too small a memory, so all sorts of ways were invented to save space. That was one reason I wrote the fast overlay loader. IMHO a smart programmer would mix LMM with overlays and we would end up with the best of both worlds (speed). Of course, I also sped up the spin interpreter by shifting the decode table to hub ram and utilised the decode table to point to 3 subroutines and set a couple of flags to improve performance.

So, in essence, heater is correct in that we have made the prop do things other faster single core cpus can do. However, I think it is fair to say we are still exploiting the prop for what it can do in the other cores. It is just a fact that most programs do still require one large program.

I have not had the time to look at Zog even though I follow the thread from time to time. It is an interesting idea. Certainly it has a following here.

Has anyone noticed there are almost 32K views!!!

Congratulations heater

BTW: Thanks Bill for getting into Chips ear - the new instructions and clut will make a world of difference to the prop II performance.

Bill Henning · 2011-05-25 17:56

Ooo... I am out old-timered!

Sorry I did not mention it was you who wrote the fast overlay loader; my old braincells failed me. I'll add it

I will not take too much credit for suggestions to Chip, I am certain a lot of others were doing the same! Everyone was asking for indexed hub pointers, cog pointers, and I am sure some must have asked for a hardware return stack.

The credit goes to Chip for willing to listen, and making suggestions better.

Ok, I will take credit for suggesting a four-instruction wide line cache using the RDQLONG hardware; but he added a couple of really cool spins to it that will benefit all VM's

And I will take credit for LMM

Thanks,

Bill

Cluso99 wrote: »

Oh Bill, you make me feel so old! CS courses were not available when I started in computers. All the courses were run by computer companies and I actually started in hardware before transferring to software. Anyway enough of that.

We were always trying to fit programs into too small a memory, so all sorts of ways were invented to save space. That was one reason I wrote the fast overlay loader. IMHO a smart programmer would mix LMM with overlays and we would end up with the best of both worlds (speed). Of course, I also sped up the spin interpreter by shifting the decode table to hub ram and utilised the decode table to point to 3 subroutines and set a couple of flags to improve performance.

So, in essence, heater is correct in that we have made the prop do things other faster single core cpus can do. However, I think it is fair to say we are still exploiting the prop for what it can do in the other cores. It is just a fact that most programs do still require one large program.

I have not had the time to look at Zog even though I follow the thread from time to time. It is an interesting idea. Certainly it has a following here.

Has anyone noticed there are almost 32K views!!! Congratulations heater

BTW: Thanks Bill for getting into Chips ear - the new instructions and clut will make a world of difference to the prop II performance.

Cluso99 · 2011-05-25 22:56

Bill, sory I didn't mention the overlay loader to get credit. It is just another piece of the puzzle and indeed I took the reverse loop from Phil? I needed the overlay routines so that I could separate the parts of the spin interpreter while I was speeding it up. I used your LMM concept to create my zero footprint debugger that lives in $1F0-$1F3.

But you are so correct, lots have contributed ideas and it is so fantastic that Chip even bothers to discuss features with us, let alone taking feedback. Prop II will truly be an awsome chip.

Heater. · 2011-05-29 12:33

Attached is v2.0 of the ZPU emulator in C for exercising ZPU code on a PC.

The main change here is a switch to select running as a big or little endian processor. I will be using this to check that the output from Andrey's little endian zpugcc compiler runs as expected. So far it does.

Run it as "zpu -h" to get a description of the command line parameters.

Andrey, When I try to compile any code that uses printf or iprintf with your patched compiler I get an error that _stack is undefined.

Anyway if little endian ZPU code runs through that emulator I can then convert Zog itself to little endian and have known working test programs.

Heater. · 2011-05-29 13:54

Andrey,

I think there is an error in your little endian patch file. The line

+STACK_ADDR= 16711680

should not have a space after the equals.

With that space removed and rebuilding the compiler I can now compile a normal "hello world" program with no errors about undefined _stack. It then runs fine under the zpu emulator:)

Now to get Zog changed to little endian...

David Betz · 2011-05-29 14:19

What ZOG sources are you using for this experiment? Are you using the sources I checked into Google Code?

Heater. · 2011-05-29 14:26

None yet. I'm just trying to prove that Andry's little endian compiler is working. The zpu emulator is pretty accurate as a test tool I think.

I will be coming to Zog itself soon. I thought I would tackle these endian mods starting from my last zog_v1_6 as I will probably need the single step debug feature.

David Betz · 2011-05-29 14:27

Heater. wrote: »

None yet. I'm just trying to prove that Andry's little endian compiler is working. The zpu emulator is pretty accurate as a test tool I think.

I will be coming to Zog itself soon. I thought I would tackle these endian mods starting from my last zog_v1_6 as I will probably need the single step debug feature.

Actually, single stepping works with the zogload version as well although I admit that I haven't tried it lately so it might have suffered bit rot. One nice thing about single stepping in zogload is that it doesn't use any Spin code. I wrote a PASM debug stub that runs entirely in a COG.

Heater. · 2011-05-29 14:33

Ah OK, I just assumed the single stepping got lost along the way. Don't know why. I'll see what I can do with it.

David Betz · 2011-05-29 14:36

Heater. wrote: »

Ah OK, I just assumed the single stepping got lost along the way. Don't know why. I'll see what I can do with it.

Theoretically you can just give -d on the command line to enable debugging mode.

Heater. · 2011-06-03 12:03

Zog goes little endian.

I just hacked my old zog_v1_6 to be a little endian ZPU. I recompiled some of th the test programs with Andrey's patched zpugcc and low they worked first time! That is: fibo, dhrystone, size, ackermann, mall and endian.

With the bonus that fibo is now almost 7% faster!

I have to test some more before I post the code.

Bill Henning · 2011-06-03 12:12

Sounds great!!!

I am curious to see how much LD look-ahead will improve things... I am hoping for >10% improvement.

Heater. wrote: »

Zog goes little endian.

I just hacked my old zog_v1_6 to be a little endian ZPU. I recompiled some of th the test programs with Andrey's patched zpugcc and low they worked first time! That is: fibo, dhrystone, size, ackermann, mall and endian.

With the bonus that fibo is now almost 7% faster!

I have to test some more before I post the code.

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments