Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

John Abshier · 2010-09-15 07:56

This is a long thread. Perhaps I have missed something. Can one in ZOG set dia, outa, and read ina?

John Abshier

Heater. · 2010-09-15 08:34

No you have not missed anything John.

In Zog v1.6 I implemented a new memory map for Zog which maps HUB and COG memory into the Zog memory space.

So normal ZPU RAM where your C code, data, and stack lives is 32MB, this is then followed by a 32K window mapped to HUB RAM and that is followed by a 2K window mapped into the COG memory of the COG running the interpreter.

One can access DIRA, OUTA and all the Prop special function registers in that last area. One can also read/write the interpreter LONGs as well which is probably not a good idea:)

So the memory map is:

Normal RAM:
$00000000 to $0FFFFFFF

HUB RAM:
$10000000 to $10007FFF

COG RAM:
$10008000 to $100087FF

This map applies both when executing from HUB or external memory.

So to access DIRA:

DIRA is at $1F6 in the COG.
But that is an offset in LONGs. As a byte address it is 4 * $1F6 = $7D8.
In the ZPU memory map it is $10008000 + $7D8 = $100087D8

In C we need a pointer to DIRA to access it:

// The address of COGs DIRA register
#define DIRA_ADDRESS 0x100087D8

// Pointer to DIRA register
int* dira_ptr = DIRA_ADDRESS;

// Set pin P0 to output
*dira_ptr = 0x00000001

WARNING: I have not tested this yet.
OTHER WARNING: I've just realized that I have not allowed for reading the Props ROM so this may change a bit.

jazzed · 2010-09-15 08:39

David Betz wrote: »

Thanks but it was actually Bill who fixed the bug.

I didn't say you fixed it

It was fixed because of your interest.

Heater, I'll look at adding a count to dhrystone 2.1 later - I'm still on personal business so my time is still limited. When I ran the test before, it took much less than 40 seconds.

David Betz · 2010-09-15 14:04

Have you compared the performance of ZOG with the performance of ZiCOG? I know one is a 32 bit CPU and the other an 8 bit CPU but for small programs that will fit in 64k it might be possible to compare them. Is the ZOG instruction set faster than the Z80 instruction set emulated on a COG?

Heater. · 2010-09-15 14:36

David,

Have you compared the performance of ZOG with the performance of ZiCOG?

That's an interesting comparison which hasn't occurred to me. The only way to know for sure is to try some experiments.

I have a feeling that comparing instruction cycle times Zog is faster. I haven't checked but I think Zog has less PASM to churn through per opcode than ZiCog.

The Z80 C compilers treat ints as 16 bits which gives more work for the processor to do. But then it's possible that doing every thing in short ints with zpu-gcc also generates more code than 32 bit ints.

On that subject, I have just posted the dhrystone 1.1 benchmark results for Zog to the compiler benchmarks thread. Attached is the dhrystone 1.1 source that I modified to get the timings under Zog.

lonesock · 2010-09-15 14:47

Can you post the binary? (preferably Hub RAM and no SD card)

thanks,
Jonathan

Heater. · 2010-09-15 15:10

You mean the binary loaded to the Prop or just the dhrystone zpu binary file.

Never mind I have attached both.

Note: They are both dependent on clock frequency and built for my 104MHz TriBlade.

lonesock · 2010-09-15 15:28

Awesome, thanks! I will use the binary to be included in the zog_debug, so I can change the crystal speed and at least see the serial output, even though the timing info will be off.

Btw, I am primarily working on the Float32 stuff to get Zog ready for a whetstone benchmark. Can I help integrating it?

Jonathan

Heater. · 2010-09-15 16:41

Lonesock,

I'm still pondering the best way to integrate floating point into Zog.

1) Do what I described for making C++ objects out of Spin objects. In which case run_zog and Zog itself needs no changes. If floating point is wanted in a C program just link it in with everything else.

2) Use the ZPU SYSCALL instruction to pass floating point requests into the Zog interpreter and have the interpreter deal with the float mail box interface.

3) Have whatever starts up Zog also start up the float cog and put the mail box at some known address in high HUB memory that C programs can the interface with.

Just now I'm leaning toward 3). It has the advantage that the float cogs PASM need not be occupying space in HUB after it is started. And again zog itself needs no changes.

For example when running from HUB float32's PASM binary blob can be used from a "file" statement that actually includes it into what will become ZPU memory space when running.

Or if using the run_zog loader both the zog and float32 code space gets overwritten when zog is started and takes over all of HUB memory.

One thing is that it is required that the command mailbox used in float32 be "owned" by the client code and passed to float32 as a parameter to start() and preferably then through PAR to the PASM. This way the client code gets to decide where the mailbox will be in memory.

lonesock · 2010-09-15 17:04

Is there any way to have the compiler flag whether _any_ float (or double) variables are defined? If so start the FP cog, otherwise not. Either way the FP code can be overwritten in the Hub RAM once this step is done.

Right now, if you declare 5 longs in a row (anywhere in Hub RAM), it will work for you. Here is the layout:

long 0: FP_command => set to 0 before the FP cognew
long 1: FP_pointer => set to @FP_result before the FP cognew
long 2: FP_result => here's where you get your return values
long 3: FP_op1 => operand 1
long 4: FP_op2 => operand 2

So, how to start: Set FP_command to 0, FP_pointer to @FP_result, cognew the Float32 PASM, with par set to @FP_command.

How to call: Set FP_op1 and FP_op2, then set the FP_command...spin until FP_command = 0, read FP_result

Jonathan

Heater. · 2010-09-15 23:53

lonesock,

Is there any way to have the compiler flag whether _any_ float (or double) variables are defined?

There may be a better way but here is one:

When the application uses basic floating point ops (+-*/) or comparisons or int/float conversions etc the compiler inserts code from it's soft-float library. However one can write those functions into ones app or create ones own library containing them. In that case the build process links those in instead. So one could put a check in all of those functions to detect the first ever floating point call and start the fp COG at that point.

I'm inclined not to worry about automatically detecting float usage. Users will just have to know if their app requires float or not and take care of starting it whether it be done from Spin at start up or from a C library function. Besides it is possible one would actually want to use the gcc soft-float routines instead of a COG.

That "mailbox" interface looks quit OK.

Also the latest version is working nicely here.

David Betz · 2010-09-16 05:43

I've been thinking about ZOG and how it makes use of COGs and also how Propeller software gets written to make use of multiple COGs in general. It seems like we sometimes just use more COGs as a way to get around the fact that a single COG is so limited in memory. For instance, ZOG uses a separate COG to access off-chip memory either by using VMCOG or SDRamCache. Both could just as well be code linked into the ZOG VM itself except that there isn't enough memory in one COG for that large a program. To me this suggests that we're not using the Propeller architecture correctly. COGs as separate execution engines are better used to create intelligent peripherals or components of a system that can run asynchronously. Using them to simply extend the address space of a single COG seems wasteful. However, we only seem to do this when we're trying to make a COG or the Propeller in general behave as a "general purpose computer" rather than letting it be a microcontroller whose purpose is to control attached hardware. If we really think that there is a need for larger control programs to orchestrate the intelligent peripherals, maybe it would make sense for the Propeller III (the PropII is already a done deal it seems) to have a more "high level language friendly" core to act as a "main processor" and leave the COGs to do what they are best at doing, controlling I/O. How about a PropIII with 8 (or 16) COGs and a MIPS core to run the "main code"?

Heater. · 2010-09-16 06:27

David,

It seems like we sometimes just use more COGs as a way to get around the fact that a single COG is so limited in memory.

True, for example ZOG is about to get use of a floating point COG. Thanks to the amazing work done by Lonesock that is likely to be down from the two COGS that floatfull used before.

For instance, ZOG uses a separate COG to access off-chip memory either by using VMCOG or SDRamCache. Both could just as well be code linked into the ZOG VM itself except that there isn't enough memory in one COG for that large a program.

True. However it is quite possible to put direct external RAM byte by byte access for the TriBlade, RamBlade and DracBlade like platforms into Zog itself. This may or may not speed it up but would save a COG. I have not done this as it seemed neat that any platform supported by VMCOG would be able to run Zog unchanged. So currently that use of an extra COG is by way of an abstraction layer for convenience rather than a necessity.

To me this suggests that we're not using the Propeller architecture correctly.

Oh dear:)

COGs as separate execution engines are better used to create intelligent peripherals or components of a system that can run asynchronously. Using them to simply extend the address space of a single COG seems wasteful.

Yep. That is why PASM overlays and LMM can be a better approach. I gave up on a 4 COG Z80 emulator attempt when I started to feel sick about wasting COGs and realized overlays would be the same speed in the end.

However, we only seem to do this when we're trying to make a COG or the Propeller in general behave as a "general purpose computer" rather than letting it be a microcontroller whose purpose is to control attached hardware.

Question is, when does micro-controller application become "general purpose computer" application?

If we really think that there is a need for larger control programs to orchestrate the intelligent peripherals, maybe it would make sense for the Propeller III (the PropII is already a done deal it seems) to have a more "high level language friendly" core to act as a "main processor" and leave the COGs to do what they are best at doing, controlling I/O. How about a PropIII with 8 (or 16) COGs and a MIPS core to run the "main code"?

Ah yes, I've often thought the a marriage of an ARM and a Prop or two on a board would be great. ARM running Linux, handling networking, storage, USB etc. Prop doing the real-time control stuff. In fact I'm planning to make such a marriage with this module http://www.igep-platform.com/index.php?option=com_content&view=article&id=109&Itemid=123 It's about the size of a DIP Prop and would sit on a board with a nice big DIP Prop on each side:)

As for Prop III that should be more of the same Prop goodness, it should have 64 bit Cogs. That makes it possible to extend the instructions src and dst fields to 25 bits and therefore allowing 25Mega longs of COG space:)

David Betz · 2010-09-16 06:48

Heater. wrote: »

I've often thought the a marriage of an ARM and a Prop or two on a board would be great. ARM running Linux, handling networking, storage, USB etc. Prop doing the real-time control stuff. In fact I'm planning to make such a marriage with this module http://www.igep-platform.com/index.php?option=com_content&view=article&id=109&Itemid=123 It's about the size of a DIP Prop and would sit on a board with a nice big DIP Prop on each side:)

Actually, Andre' LaMothe already has a couple of boards that combine a Propeller chip and another more traditional microcontroller. They are his Chameleon products. The problem is that the bandwidth between the PIC/AVR and the Propeller isn't that fast. It is done using a SPI connection. I guess I thought the communications between the COGs and the CPU could be make more efficient if both were on the same chip. Maybe allow the CPU to use DMA to directly access COG RAM and/or hub RAM.

jazzed · 2010-09-16 07:02

Funny MIPS was mentioned before. I was just looking at the Maxim USIP system on a chip ... very attractive especially given my MIPS background.

I'm working with a tiny AVR connected to a Propeller in my spare time this week that will use I2C to provide a keyboard and mouse interface - no cogs wasted

Of course a combined PASM EEPROM/Keyboard/Mouse COG driver also makes sense if speed is important.

On VMs, I had a similar result as Heater by merging all code dispatch into one COG - there was little difference in performance and the code is easier to maintain.

Heater. · 2010-09-16 07:55

Yep, bandwidth between the Prop and anything else is always an issue but if the general purpose CPU is only there for user interface stuff, diagnostics, logging etc or to communicate pertinent data over the net that does not matter so much.

The more I think about this the sweeter it gets. That little IGEP module and two Props can sit on a board half the size of my TriBlade. The ARM can be used to program the Props. After twisting BradC's arm (pun intended) a bit we will be able to run BST on the IGEP and have a self hosting board:)

lonesock · 2010-09-16 09:44

Heater. wrote: »

...So one could put a check in all of those functions to detect the first ever floating point call and start the fp COG at that point.
...
Also the latest version is working nicely here.

Sure, lazy loading would work fine, if you wanted to go that route. Glad it's working!

OK, some thoughts about a FP cog used with Zog:

* as FP will be computed in a different cog, you don't have to waste time in Zog while waiting for the FP result...you might as well be doing other stuff if possible. How does GCC handle this case with co-processors? Is there a "start FP mul", then a separate "get FP result", maybe with an "expected time to completion" to allow GCC to schedule other stuff in the space?

* If the user is running multiple Zog cogs, do you want more that one FP cog? Otherwise, we should probably set up a locking system. And if we do a separate FP start and get_result, this lock would be absolutely necessary to make sure cog A doesn't get cog B's result, and vice versa.

* lazy loading also implies you could even start more than one FP cog. If you are using only 1 Zog cog, but have a very heavy FP app, and assuming the compiler can handle the FP_start and FP_get_result with multiple pipes, you could run multiple FP cores for increased throughput.

I don't know if there is an elegant way to accomplish any of this, just been on my mind.

Jonathan

Heater. · 2010-09-16 11:35

Lonesock,

Those are some big thoughts there. Personally I'm not up for any of them, I'd rather Keep It Simple.

* As far as GCC is concerned it issues a floating point instruction and will not do anything else until it is done. Just the same as any other instruction. In the absence of floating point instructions it just inserts a call to a subroutine to handle it. Same but slower.

* I think one fp COG is enough. If anyone feels the need for two or more isn't it time they got a processor better suited to their task? If they really want to a C++ wrapper class for fp will allow them to start more.

* Potentially more than one Zog could be running and both want to use fp. Firstly again I'd say isn't time to look for a more suitable processor? However as Zog will be able to access Propeller locks a lock could be used in the C++ wrapper class if required.

* I'm inclined to forget the lazy loading idea.

Anyway, let's get something working first:) Remember this is a micro-controller after all and a little floating point is a luxury item.

David Betz · 2010-09-17 20:05

I'm trying to port ZOG to Andre' LaMothe's C3 board and I'm having a bit of trouble with updating sdspiFemto.spin. I have to add a little extra code to deal with Andre's scheme for selecting SPI devices and it is overflowing the COG RAM by seven longs. I'm wondering if ZOG needs the I2C code that's in sdspiFemto.spin or if I can comment it out to gain back some space. Does the I2C code get used for anything by ZOG?

Heater. · 2010-09-18 00:22

David,

No Zog does not use any I2C features of sdspiFemto.spin.

I haven't thought really hard about file systems and sd drivers yet. I only use that version because it came from Cluso and he has adapted it to share pins with external RAM on the TriBlade.

If you make a smaller version with no I2C that can still do that pin sharing I might want to take it into use as well.

David Betz · 2010-09-19 04:19

I got ZOG running on Andre' LaMothe's C3 board using its two SPI SRAMs and my direct-mapped cache code. Here are the results of running FIBO on it.

ZOG v1.6 (CACHE)
Starting SD driver...0000FFFF
Mounting SD...00000000
Booting fibo.bin
00000000

Reading image... 17055 Bytes Loaded.
Done

Clearing bss: .......................
Running Program!
fibo(00) = 000000 (00000ms)
fibo(01) = 000001 (00000ms)
fibo(02) = 000001 (00000ms)
fibo(03) = 000002 (00000ms)
fibo(04) = 000003 (00001ms)
fibo(05) = 000005 (00001ms)
fibo(06) = 000008 (00003ms)
fibo(07) = 000013 (00005ms)
fibo(08) = 000021 (00008ms)
fibo(09) = 000034 (00014ms)
fibo(10) = 000055 (00022ms)
fibo(11) = 000089 (00037ms)
fibo(12) = 000144 (00059ms)
fibo(13) = 000233 (00097ms)
fibo(14) = 000377 (00157ms)
fibo(15) = 000610 (00254ms)
fibo(16) = 000987 (00414ms)
fibo(17) = 001597 (00673ms)
fibo(18) = 002584 (01092ms)
fibo(19) = 004181 (01765ms)
fibo(20) = 006765 (02846ms)
fibo(21) = 010946 (04589ms)
fibo(22) = 017711 (07402ms)
fibo(23) = 028657 (11954ms)
fibo(24) = 046368 (19329ms)
fibo(25) = 075025 (31299ms)
fibo(26) = 121393 (09798ms)

#pc,opcode,sp,top_of_stack,next_on_stack
#----------

0X00034D1 0X00 0X0000FFB8 0X00003822
BREAKPOINT

David Betz · 2010-09-19 04:35

Okay, now that I have ZOG running on the C3, how do I go about doing file I/O? I'd like to try to get my Basic bytecode compiler running in the 64k of SPI SRAM. If I can do that, I may try to retarget it to ZOG bytecodes so I can run the compiled Basic under ZOG instead of my own VM. I only need simple block level file I/O. Something like the Unix-style open/creat/read/write/lseek suite of functions. Is that possible currently using ZOG?

Heater. · 2010-09-19 04:42

Amazing, that's running pretty much the same speed as my VMCog ext RAM on a TriBlade with 20 pages of working set.

Eg,

fibo(20) = 006765 (02846ms) - Yours,
fibo(20) = 006765 (02812ms) - Mine

Still, there is not much paging/swapping going on with this tiny fibo.

How many pins is this solution using?

David Betz · 2010-09-19 06:19

Heater. wrote: »

Amazing, that's running pretty much the same speed as my VMCog ext RAM on a TriBlade with 20 pages of working set.

Eg,

fibo(20) = 006765 (02846ms) - Yours,
fibo(20) = 006765 (02812ms) - Mine

Still, there is not much paging/swapping going on with this tiny fibo.

How many pins is this solution using?

The C3 uses five pins for its SPI interface. The three normal SPI pins, DI, DO, CLK and a pair of pins used for addressing, CLR and CLK. This allows it to address up to 7 SPI devices by using a counter and a demultipliexer. To set an address you first pulse CLR to clear the counter to zero, then pulse CLK as many times as you need to to get to the device you want to select. The SPI SRAM chips have the first two addresses, 1 and 2. Zero is reserved for deselecting all devices. The SD card is at address 5 and the other addresses are used for a flash chip and an A/D converter chip as well as two that are brought out to pins on a header that can be used by external hardware.

Heater. · 2010-09-19 06:41

That is neat. Love to see how it performs with bigger code/datak. How about you try dhrystone?

David Betz · 2010-09-19 10:35

Heater. wrote: »

That is neat. Love to see how it performs with bigger code/datak. How about you try dhrystone?

Where do I find the dhrystone binary file?

Heater. · 2010-09-19 12:17

Oops, sorry forgot i have not posted my dhrystone with working timer. That is a problem now as I'm away from all computers for two days and only have this phone to post with.

David Betz · 2010-09-19 14:47

Heater. wrote: »

Oops, sorry forgot i have not posted my dhrystone with working timer. That is a problem now as I'm away from all computers for two days and only have this phone to post with.

No problem. Let me know when you get a chance to upload it after you get back.

Thanks,
David

Heater. · 2010-09-23 16:06

David Betz:

...how do I go about doing file I/O?

Somehow I missed that question.

There are two ways to go here here:

1) Create a C wrapper for some SD card block driver object's PASM code and use a FAT file system written in C. There are sources for two FAT systems in the Zog directories already.

2) Make use of the ZPU SYSCALL instruction which can get us out of the emulator back into Spin and implement open, close, read, write etc as Spin syscall handlers.

Eventually I'd like to do 1) but I don't think it will fit into 64K.

The SYSCALL mechanism is already in place for stdio to the console and can easily be extended for file I/O so I think that is what will happen next. After I have the floating point implemented.

It would be great to have a BASIC that compiled to ZPU byte codes. There might be some ZPU in FPGA users that would be interested in it as well.

P.S. The first attempt at float support for Zog is now starting to work using Lonesock's new improved Float32.

jazzed · 2010-09-23 19:01

Heater, I've not had any time to look at integrating dhrystone with a timer. Did you build something ?

Thanks.

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments