Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Heater. · 2011-07-04 07:49

Strangely enough I was wondering what on earth IBM does now a days having disposed of all their commonly known business like PC's laptops and hard drives. Even the barman in my local pub had the same question.

Well in 2010 they had a turn over 99.8 billion dollars! That makes them 55th on the list of highest turn over companies. As compared to Microsoft at only 62 billion. Google does not even appear on the wikipedia list http://en.wikipedia.org/wiki/List_of_companies_by_revenue

As my colleague said "They must sell a **** load of stuff to do that". Quite what that stuff is I'm still not sure but as a clue the electronic check in terminals at Dublin airport are displaying the IBM logo in the corner of the screen.

John Abshier · 2011-07-04 07:58

A big part of IBM is services.

John Abshier

Tor · 2011-07-04 08:06

Some large customers of the company I work for use a lot of IBM computers. They buy tons of them, from PC sized boxes to huge water-cooled roomfulls of racks. The thing is that their applications could all have been running on Intel hardware w/LInux (and they do have some such systems too), but IBM gives them this: They will guarantee full hardware support for every system for 15 years minimum. And that is why (and the only reason why) they buy IBM.

David Betz · 2011-07-07 05:35

Heater: Any update on when you might be able to send or upload the little-endian version of ZOG? I'd like to make it work with my Propeller external memory loader.

Thanks,
David

Dr_Acula · 2012-03-09 15:07

Hi heater,

I've just finished designing a memory board for a touchscreen and as an unintended consequence, have a fairly fast memory address circuit. It uses two 512k memory chips in parallel, and you load an address into 3 latches (19 bits), then read or write a 16 bit word. I think it is only 6 or 7 pasm instructions per word. (cf the dracblade which is 40 pasm instructions per word).

What this means is you can move data to and from external ram almost as fast moving it to and from hub.

So I was thinking about languages that are designed from the ground up to bypass LMM and go straight to XMM with a flat 32 bit memory model.

Maybe Zog already does this?

If not, looking through the opcodes http://repo.or.cz/w/zpu.git?a=blob_plain;f=zpu/docs/zpu_arch.html;hb=HEAD#instructionset these are certainly very minimalist. All stack based. Perfect for Forth.

What accesses external memory?

STORE and LOAD certainly.
And the core part of the emulator that is fetching the program.

Is the stack held in external memory, or is there so much hub now free in such a model that you would put the stack in hub?

David Betz · 2012-03-09 18:43

Dr_Acula wrote: »

Hi heater,

I've just finished designing a memory board for a touchscreen and as an unintended consequence, have a fairly fast memory address circuit. It uses two 512k memory chips in parallel, and you load an address into 3 latches (19 bits), then read or write a 16 bit word. I think it is only 6 or 7 pasm instructions per word. (cf the dracblade which is 40 pasm instructions per word).

What this means is you can move data to and from external ram almost as fast moving it to and from hub.

So I was thinking about languages that are designed from the ground up to bypass LMM and go straight to XMM with a flat 32 bit memory model.

Maybe Zog already does this?

If not, looking through the opcodes http://repo.or.cz/w/zpu.git?a=blob_plain;f=zpu/docs/zpu_arch.html;hb=HEAD#instructionset these are certainly very minimalist. All stack based. Perfect for Forth.

What accesses external memory?

STORE and LOAD certainly.
And the core part of the emulator that is fetching the program.

Is the stack held in external memory, or is there so much hub now free in such a model that you would put the stack in hub?

Why not just make a cache driver for either PropGCC or Catalina?

Dr_Acula · 2012-03-10 03:30

Good question. Catalina already has an option to add a cache and it greatly speeds up the program.

It is more a question of getting inside the code of these C programs. Fine if the code part of "printf" is separate from the display part of "printf" but in a touchscreen with external memory they end up entwined with each other. I haven't looked at GCC. Catalina can probably do this but it is very complicated to understand to the point of being able to rewrite both the display and the memory management driver. I am sure it can be done for both programs but probably not something I would be able to do. You need to be able to pause program execution while the display cog takes over memory access.

I was thinking Zog might be simpler to understand, given there are even less opcodes than pasm.

More bluesky brainstorming really than anything concrete. I might crawl back under my rock for a while and think about this some more.

Heater. · 2012-03-10 05:55

David,

Are you still interested in the little-endian Zog? With all the propgcc progress it seems to be a bit overtaken.

I'd still love to get that wrapped up but I'm seriously busy until the end of March or so and it would take a while for me to find where every thing is and remind myself of where we got to.

David Betz · 2012-03-10 10:59

Heater. wrote: »

David,

Are you still interested in the little-endian Zog? With all the propgcc progress it seems to be a bit overtaken.

I'd still love to get that wrapped up but I'm seriously busy until the end of March or so and it would take a while for me to find where every thing is and remind myself of where we got to.

Yes, I'm still interested especially since I recently found that a program I'm working on doesn't fit in hub memory anymore. Compilers that target LMM like PropGCC and Catalina are great for generating relatively fast code but their code density isn't very good. The higher code density of ZOG would be nice in cases where you need to fit lots of code in a small space but don't care if it runs a little slower.

(Edited for clarity)

Dr_Acula · 2012-03-14 02:55

Hi heater et al.

OBC posted a question today about using external ram for femto basic http://forums.parallax.com/showthread.php?138617-23K256-SRAM-Why-not-expand-Femto-to-use-this

This has me thinking about a new language - one that can be interpreted on the propeller, compiled on the propeller, or compiled on a PC and downloaded to the propeller. And which can be as big as you like.

To write a high level language, you need a low level language. So I've been brainstorming the idea of a language designed from the ground up to live purely in external ram.

It is kind of similar to Zog, but sooner or later you have to consider stack vs register models.

Zog is fascinating to study. I love the way you can store constants to the stack by pushing 4 bytes. That gets around something in pasm that really bugs me - the 9 bit limit on a 32 bit machine. So instead of having an ever increasing list of constants in pasm that eventually grows so big it consumes the entire cog, you can push four bytes in a row, which pasm can do because that is only 8 bits, not 9.

But is a stack based machine the very best model?

Considering this question led me to this paper http://static.usenix.org/events/vee05/full_papers/p153-yunhe.pdf

So much to consider in there.

And then I got sidetracked with the very first thing you need to write in an emulator, which is a way of jumping to each opcode based on the value of the opcode. It is interesting to consider writing an emulator in C - this is the first thing you come to and C can't do it! To explain that somewhat controversial statement, I found a discussion that explores this very issue http://blog.mozilla.com/dmandelin/2008/06/03/squirrelfish/ and they mention the problem in C as well - "switch" is not an efficient way to code a conditional jump table.

Pasm can do this but I'm not sure the technique ends up being used in high level compilers for the propeller. More a curiosity in things like the Z80 emulation. There is an intriguing discussion in that blog above saying that if you are writing an emulator and you are going to use what is called direct threading, why not simplify it? Why not simplify it down to one instruction!

Those who hate "goto" will have a fit - direct threading is the ultimate combination goto nightmare. But if you need it to run a decent emulation, maybe it could be useful in higher level languages as well? The original BASIC had things like "ON A GOTO 1000,2000,3000,4000" and this could be compiled down to one emulated instruction that will run far faster than a C switch if there are many jump options (eg 256 or more) as you don't have to search through every one before making the jump.

There are other instructions that one could think about if one were to code a new emulation. An example is
a = b+c
which one could code to one assembly instruction. The instruction has three instructions, a destination and two sources, which takes a little getting used to.

So with 2k of code space in a cog, I am wondering if it is possible to fit more than a stack based emulator and think about a hybrid stack and register based emulation? I believe such an emulation does give the best of both worlds in terms of speed and memory use.

That .pdf paper above explores some of this in more detail and there are some intriguing instructions for a virtual machine that seem kind of easy to understand. An example is this bit of register based code, with the stack based equivalent in comments

1. move r10, r1 //iload_1
2. move r11, r2 //iload_2
3. iadd r10, r10, r11 //iadd
4. move r3, r10 //istore_3

Not a language I have ever seen, but it seems to make sense. I wonder if this language or something like it exists somewhere?

Heater. · 2012-03-14 03:51

I always thought that a C switch statement compiled down to a single jump though a lookup table if all the cases were consecutive values. Haven't looked at such compiler output for a long time. This is of course much quicker than testing all the cases until you find the right one.

If that does not work out one can always build a table of pointers to opcode handler functions, index the table from the opcode value and then call the function in that table entry. That's nice and fast but introduces a little over head of a call and return for each opcode function execution.

In GCC we can get rid of that by using a GCC extension called "labels as values". This feature allows you to "goto someVariable" where "someVariable is actually the address of a label somewhere in the current function. So all you have to do is define a table containing the labels of all your opcode handlers, look up a table entry from the opcode and goto that entry, "goto dispatchTable[opcode]". At the end of the opcode handler function just goto the head of the dispatch loop again.

I made a FullDuplexSerial driver in propgcc that uses labels as values to create the equivalent of the coroutines for tx and rx that you see in Chip's original FDS. It fits and runs in a COG and works at 115200 baud! It's in the propgcc demos.

Dr_Acula · 2012-03-14 04:37

Wow - that is all very cunning!

So is Zog being replaced with GCC?

Heater. · 2012-03-14 05:34

Dr_A,

So is Zog being replaced with GCC?

Hmm...well you can't really say that as ZOG runs ZPU byte codes that are compiled by GCC for the ZPU architecture (zpugcc) and the Prop now has a kernel to run LMM PASM compiled by GCC (propgcc). So it's GCC everywhere!!

Of course propgcc compiled code will run much faster than the ZOG bytecode interpreter, at least when talking about operating from HUB memory. But potentially ZOG byte codes give smaller binaries so you can get more functionality into HUB space. As Dave Betz pointed out a few posts back.

I suspect also that there are some external memory architectures where having ZPU byte codes might be quicker than fetching PASM instructions.

But then again propgcc will always have full support from Parallax where as Zog only has a handful of followers.

ersmith · 2017-04-09 15:28

OK, I've done something a bit insane. As part of a project of my own which could use a small bytecode language, I decided to try play with the ZPU instruction set. I've been meaning to see if I could experiment with JIT compiling, and ZPU seemed like it might be simple enough to work with. Indeed it was, and in no small part to Heater's nice clean code I was able to produce a version of ZOG that compiles ZPU instructions to Propeller instructions dynamically. The idea is that every ZPU bytecode maps to exactly two Prop instructions, which may themselves be subroutine calls (for the more complicated instructions). The Prop instructions are kept in an internal cache; in fact there's an L1 cache (inside COG memory) and an L2 cache (in HUB memory) so that we don't have to recompile too often. The ZPU branch instructions call out to a routine that looks up the cache line, calculates the offset into the line, and jumps there (after compiling and/or loading the converted instructions as necessary).

I've attached zog_jit.spin. It's a drop in replacement for the 1.6 zog.spin, except that it uses more HUB space (so zpu memory size has to be reduced a bit; I found I could fit a 24KB ZPU image).

Performance is a little better than zog.spin, but not as much as I had hoped. Results with fibo and xxtea:

xxtea: zog=970128 cycles, jit=851840 cycles, spin=1044640 cycles
fibo:  zog=220064 cycles, jit=150672 cycles, spin=137360 cycles

(the fibo time is for fibo(8)).

There's probably some room for improvement; for example, some simple optimizations like merging a push with a following operation, or optimizing branches that fit within a cache line so they don't have to call out to the external branch routine. OTOH we're getting quite tight on memory; zog_jit works in HUB memory exclusively, it can't deal with any kind of external memory.

On the P2 we could use the LUT as cache, and probably get much better performance. It would be interesting to see how JIT compiling compares with the new P2 fast bytecode lookups.

Eric

Heater. · 2017-04-09 20:07

eric,

Yes, that is insane.

You are waking Zog up from his slumbers in the iceberg. Who knows what happens next...

http://www.marvunapp.com/Appendix3/zogjim.htm

Or at least, above and beyond the call of duty. Totally awesome.

Of course what we want to see is a P2 version of this. GCC on the P2 via the ZPU.

I had been thinking about that earlier, but now my mind has moved on to RISC V.

I notice I don't have a github repo for Zog. Perhaps I should create one and then your JIT Zog could be included as an option.

Or vice versa of course.

Heater. · 2017-04-10 00:00

Eric,

Thanks for the comment about "Heater's nice clean code"

Nice to know that I can sometimes produce something legible.

ersmith · 2017-04-10 13:34

Heater. wrote: »

Thanks for the comment about "Heater's nice clean code"

Nice to know that I can sometimes produce something legible.

More than legible, it was actually a pleasure to read.

I'm afraid I messed it up in the process of porting (at least partially because I started from zog 1.0 before discovering that there was a zog 1.6 buried in the comments

).

I'll push my sources up to a github repo, but it's still quite in flux. I feel like there should be some performance improvements available, and I think the dominant cost right now is cache misses, due to the tiny internal cache. I may try overlaying the compilation code into the cache line space, which should allow a much bigger internal cache.

Eric

Heater. · 2017-04-10 19:22

Eric,

Oh what, you wadded through all those pages?

Ah, sorry. By the time of the 1.6 version I had locked myself out of the old "heater" account when the forum change over happened. So I could no longer update the first post.

ersmith · 2017-04-14 00:07

I've created a git repo at https://github.com/totalspectrum/zog.git with my latest version of zog_jit (and Heater's zog 1.6). The new JIT has a larger 2 way internal cache; some of the compiler code is overlaid with the cache, which raises the cost of cache misses but gives us a lot more space for cache.

Results on xxtea decoding:

GCC LMM -Os:   21984 cycles
fastspin:      36400 cycles
GCC CMM -Os:  362544 cycles
zog_jit:      442016 cycles
zog 1.6:      970128 cycles
OpenSpin:    1044640 cycles

Results for fibo(8):

GCC LMM -Os:   10992 cycles
fastspin:      32128 cycles
GCC CMM -Os:   58800 cycles
zog_jit:      115680 cycles
OpenSpin:     137360 cycles
zog 1.6:      220064 cycles

So it looks like JIT compilation can give respectable numbers, although still not up to CMM levels of performance. It'd be very interesting to see how this compares with P2 XBYTE.

Eric

jmg · 2017-04-14 10:43

ersmith wrote: »

So it looks like JIT compilation can give respectable numbers, although still not up to CMM levels of performance. It'd be very interesting to see how this compares with P2 XBYTE.

Sounds good - is it easy to add the code Size for each of these to the tables ?.

ersmith · 2017-04-14 11:35

jmg wrote: »

ersmith wrote: »

So it looks like JIT compilation can give respectable numbers, although still not up to CMM levels of performance. It'd be very interesting to see how this compares with P2 XBYTE.

Sounds good - is it easy to add the code Size for each of these to the tables ?.

The fibo function is 36 bytes in ZOG, versus ~100 bytes for the LMM compilers (fastspin and gcc), 26 bytes for gcc CMM, and 25 bytes for Spin.

The btea function (main routine of xxtea) is 369 bytes in ZOG, versus 326 in Spin, 320 in gcc cmm, 696 in gcc lmm. and 904 in fastspin.

ersmith · 2017-04-14 12:28

Another benchmark: Heater's own fftbench:

gcc lmm:   142829 us
fastspin:  171422 us
gcc cmm:   508023 us
zog JIT:   976817 us
zog 1.6:  1376254 us
openspin: 1465321 us

The fft_bench code size:

gcc lmm: 1156 bytes
gcc cmm:  514 bytes
zog:     1065 bytes

Extracting the size of just one function is a pain in Spin, so I haven't tried to do it. Based on the overall binary sizes I'm sure it's smaller than CMM.

David Betz · 2017-04-14 12:42

ersmith wrote: »
Another benchmark: Heater's own fftbench:
gcc lmm:   142829 us
fastspin:  171422 us
gcc cmm:   508023 us
zog JIT:   976817 us
zog 1.6:  1376254 us
openspin: 1465321 us
The fft_bench code size:
gcc lmm: 1156 bytes
gcc cmm:  514 bytes
zog:     1065 bytes
Extracting the size of just one function is a pain in Spin, so I haven't tried to do it. Based on the overall binary sizes I'm sure it's smaller than CMM.

It's interesting that the ZPU code isn't much smaller than the LMM code.

ersmith · 2017-04-14 13:16

David Betz wrote: »

It's interesting that the ZPU code isn't much smaller than the LMM code.

Yes, it is interesting, and I've also observed it in a few other programs I've tried (like my LISP interpreter). I think it's at least partly due to the mismatch between the ZPUs stack architecture and gcc's register architecture; there ends up being a lot of shuffling of data around the stack.

David Betz · 2017-04-14 13:31

ersmith wrote: »

David Betz wrote: »

It's interesting that the ZPU code isn't much smaller than the LMM code.

Yes, it is interesting, and I've also observed it in a few other programs I've tried (like my LISP interpreter). I think it's at least partly due to the mismatch between the ZPUs stack architecture and gcc's register architecture; there ends up being a lot of shuffling of data around the stack.

I'm not sure why the mismatch between the ZPU and gcc's register architecture should make the ZPU code itself bigger. I think part of it might be the somewhat odd way they encode large immediate values.

ersmith · 2017-04-14 14:21

David Betz wrote: »

ersmith wrote: »

Yes, it is interesting, and I've also observed it in a few other programs I've tried (like my LISP interpreter). I think it's at least partly due to the mismatch between the ZPUs stack architecture and gcc's register architecture; there ends up being a lot of shuffling of data around the stack.

I'm not sure why the mismatch between the ZPU and gcc's register architecture should make the ZPU code itself bigger. I think part of it might be the somewhat odd way they encode large immediate values.

The ZPU code is generated by gcc, and I think it does a suboptimal job because of the stack architecture (it's generating some redundant stack moves, I suspect). I don't think the odd immediate encoding is a problem -- pushing a 32 bit value takes only 5 bytes, and doing a 16 bit value is only 3 bytes, which is what it would be on any other stack machine too once you add in the opcode byte.The GNU linker does try to optimize the push sequences so that the general case 5 byte sequence can be reduced to the least one that will work. Perhaps it isn't doing a very good job of that.

David Betz · 2017-04-14 14:54

ersmith wrote: »

David Betz wrote: »

ersmith wrote: »

Yes, it is interesting, and I've also observed it in a few other programs I've tried (like my LISP interpreter). I think it's at least partly due to the mismatch between the ZPUs stack architecture and gcc's register architecture; there ends up being a lot of shuffling of data around the stack.

I'm not sure why the mismatch between the ZPU and gcc's register architecture should make the ZPU code itself bigger. I think part of it might be the somewhat odd way they encode large immediate values.

The ZPU code is generated by gcc, and I think it does a suboptimal job because of the stack architecture (it's generating some redundant stack moves, I suspect). I don't think the odd immediate encoding is a problem -- pushing a 32 bit value takes only 5 bytes, and doing a 16 bit value is only 3 bytes, which is what it would be on any other stack machine too once you add in the opcode byte.The GNU linker does try to optimize the push sequences so that the general case 5 byte sequence can be reduced to the least one that will work. Perhaps it isn't doing a very good job of that.

Okay, I understand what you mean now. Yes, that is definitely a problem. We might have a similar problem if we try to target Chip's Spin2 VM with GCC.

ersmith · 2017-04-20 12:47

If we want to get ZOG working with the P2 XBYTE interpreter mechanism we'll have to do away with the address modifications to adapt for byte swaps, i.e. we'll need a little endian tool chain. That turned out to be pretty straightforward, and I've put my version of the ZPU GCC toolchain adapted for little endian support up on https://github.com/totalspectrum/zpugccle. There are no binary versions for now, just the source, but it's easy to build under Linux (there's a build.sh script to do it).

I've also modified my copy of ZOG (https://github.com/totalspectrum/zog) to use the little endian tools. This resulted in a pretty nice (>10%) performance bump for the interpreted ZOG, and a much smaller performance bump for the JIT ZOG.

Eric

Heater. · 2017-04-20 18:40

Eric,

That is beyond cool.

Now, what do I need to change in my old Zog to use your LE tool chain.....

Do you have that in a repo that I could pull from?

ersmith · 2017-04-20 19:00

Heater. wrote: »

Eric,

That is beyond cool.

Now, what do I need to change in my old Zog to use your LE tool chain.....

Do you have that in a repo that I could pull from?

The LE toolchain is github.com/totalspectrum/zpugccle and my version of ZOG (which has both the JIT and your original ZOG interpreter, modified to work with little endian) is github.com/totalspectrum/zog. Make sure you get the "master" branch of zog, for some reason it keeps showing me a different branch on github. All I had to change in your interpreter was to comment out the lines that said "XOR here is an endianness fix", and then in the test Makfiles to comment out the lines that had objcopy --reverse-bytes=4.

Eric

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments