Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Heater. · 2017-04-20 19:11

Can we follow protocol and you make a pull request on my repo ?

ersmith · 2017-04-20 19:17

Heater. wrote: »

Can we follow protocol and you make a pull request on my repo ?

Which repo would that be? I looked on your github page and couldn't find a repo for Zog, but maybe I overlooked it? Or is it somewhere other than github?

Thanks,
Eric

Heater. · 2017-04-20 19:32

Oh poop. You are right. There isn't one. I kept thinking to put it there but never did. My mind is slipping away....

Looks like you are the owner of the Zog project now

KeithE · 2017-04-20 23:20

Hmm - maybe heater can have a RISC V surrounded by ZPU minions in a future FPGA. I assume that the ZPU is smaller than a RISC V, but haven't verified. I hacked on a Propeller COG core out of curiosity and it seemed larger than RISC V at least in the Lattice iCE40 parts. (The free tools don't like the multidimensional wires which Chip used as muxes.)

Heater. · 2017-04-21 00:29

That is an interesting question.

There is a ZPU core in verilog in the ZPU repo: https://github.com/zylin/zpu/blob/master/zpu/hdl/avalanche/core/zpu_core.v

It looks like it might almost be a drop in replacement for the picorv32. I had half a mind to do that and see how big it came out when compiled under Quartus, even if I never got it running.

Ariba · 2017-04-22 10:39

Eric

Wikipedia says this: "The base ZPU has a 32-bit data path. The ZPU also has a variant with a 16-bit-wide data path, to save even more logic."

Do you know if the ZPU-GCC supports also the 16bit wide ZPUs (16bit int and 32bit long)?
I have not found a commandline switch for that.
For small FPGAs and simple control tasks, a 16bit version of the ZPU would be very welcome.

Andy

Heater. · 2017-04-22 11:07

Heard of it. Never seen it.

How small an FPGA do you want to go?

The picorv32 only takes 7% of a DE0 Nano and fits in those cheap little Lattice ICE 40 devices with plenty of room spare. One can even make a version that uses 16 registers instead of 32 for extra shrinkage.

ersmith · 2017-04-22 11:17

Ariba wrote: »

Wikipedia says this: "The base ZPU has a 32-bit data path. The ZPU also has a variant with a 16-bit-wide data path, to save even more logic."

Do you know if the ZPU-GCC supports also the 16bit wide ZPUs (16bit int and 32bit long)?
I have not found a commandline switch for that.
For small FPGAs and simple control tasks, a 16bit version of the ZPU would be very welcome.

I don't see any options for anything other than a 32 bit version of GCC for the ZPU. Perhaps the data path is 16 bits wide but all instructions still operate on 32 bits? I'm not sure, I haven't seen any references to the smaller ZPU (other than Wikipedia).

Eric

KeithE · 2017-04-22 16:10

I couldn't find any reference either. I don't think that reducing the number of registers will reduce the FPGA area given that they are implemented in RAMs. This paper did come up about a bit serial approach for a MIPs ALU:

"Ultrasmall: A Tiny Soft Processor Architecture with Multi-Bit Serial Datapaths for FPGAs"
https://www.jstage.jst.go.jp/article/transinf/E98.D/12/E98.D_2015PAP0022/_pdf

Ariba · 2017-04-22 16:31

Eric

Thank you for checking.
As long as I only use short data types the 32bit GCC may work well also for 16bit. But sometimes you need 32bit calculations. I tried long long and ZPU-GCC produces a lot of code to work with two ram locations per variable. This will then just tie two 16bit values together on a 16bit implementation. Data arrays however must be declared as int, not long long.
All in all a bit strange...

Heater
"How small an FPGA do you want to go?"

Lattice has some nice, cheap 1.2k-LUT FPGAs with Flash on chip and single supply voltage (MachXO2/3). There are even smaller ones, but they have not enough RAM. My observation is: for these little FPGAs and size optimized CPUs the LUT count nearly doubles for 32bit, compared to 16bit. Also the max frequency falls to 2/3 of the 16bit version.

Andy

Heater. · 2017-04-22 17:06

These FPGA's are something I'm only just getting to grips with. The last time I used any programmable logic it was PALs back in the 1980s !

Clearly the are all kinds of CPU cores one could use of all kind of sizes. That is at least what I gather from opencores.org.

I'm attracted to the RISC V because I'm sure compiler support will be around for a long time.

I'm looking forward to getting some cheap little Lattice ICE 40 parts to play with. With the added bonus that there are now Open Source tools to configure them.

jmg · 2017-04-23 00:32

Ariba wrote: »

Lattice has some nice, cheap 1.2k-LUT FPGAs with Flash on chip and single supply voltage (MachXO2/3). There are even smaller ones, but they have not enough RAM. My observation is: for these little FPGAs and size optimized CPUs the LUT count nearly doubles for 32bit, compared to 16bit. Also the max frequency falls to 2/3 of the 16bit version.

What about the speed of 32b counters ?
I like the approach P2 has taken, which is effectively choose a SysCLK that is the Counter-limit, and then the core runs at SysCLK/2
The focus on 1 clk per opcode was marketing driven, and is not best use of silicon.

16b is out of fashion, and you are right about code size, but one limit for 16b is counters should really be 32b not 16b, for sensible dynamic range.
I cannot understand why anyone makes a 32b MCU with 16b timers.
Some vendors do a 24b+8b timer, where 8b selects the prescaler, and that can save needing 2 registers to config, at some loss of precision.
NXP I think have 32b prescaler and 32b timer, which is very easy to understand, if a little more silicon than strictly anyone needs.

ersmith · 2017-04-23 20:53

I've managed to get ZOG working on P2. It's checked in to my ZOG repo (https://github.com/totalspectrum/zog). It needs fastspin version 3.6.2 to compile for P2:

  fastspin -2 debug_zog.spin
  loadp2 /dev/ttyUSB1 debug_zog.binary -t

It does not use any P2 specific features yet, so multiply and divide are still done in software and there's no XBYTE loop.

(The ZPU programs must be compiled with the little endian zpugccle to work my version of ZOG. The source for that is in https://github.com/totalspectrum/zpugccle).

Eric

jmg · 2017-04-23 23:01

ersmith wrote: »
I've managed to get ZOG working on P2. It's checked in to my ZOG repo (https://github.com/totalspectrum/zog). It needs fastspin version 3.6.2 to compile for P2:
  fastspin -2 debug_zog.spin
  loadp2 /dev/ttyUSB1 debug_zog.binary -t
It does not use any P2 specific features yet, so multiply and divide are still done in software and there's no XBYTE loop.

Cool.
Any benchmark figures (speeds, code sizes) of this first initial version ? eg ZOG.FPGA vs ZOG.P1 vs ZOG.P2 ?

ersmith · 2017-04-24 12:23

jmg wrote: »
ersmith wrote: »
I've managed to get ZOG working on P2. It's checked in to my ZOG repo (https://github.com/totalspectrum/zog). It needs fastspin version 3.6.2 to compile for P2:
  fastspin -2 debug_zog.spin
  loadp2 /dev/ttyUSB1 debug_zog.binary -t
It does not use any P2 specific features yet, so multiply and divide are still done in software and there's no XBYTE loop.
Cool.
Any benchmark figures (speeds, code sizes) of this first initial version ? eg ZOG.FPGA vs ZOG.P1 vs ZOG.P2 ?

The results.txt file in the github repo tracks the benchmarks, and you can examine it to see what effect various optimizations have had. The current P2 version of the ZPU interpreter is more than twice as fast as the P1 version (at the same clock speed), but that's not an apples to apples comparison because the P2 one has some generic optimizations that the P1 one is missing.

David Betz · 2017-04-24 18:09

ersmith wrote: »
jmg wrote: »
ersmith wrote: »
I've managed to get ZOG working on P2. It's checked in to my ZOG repo (https://github.com/totalspectrum/zog). It needs fastspin version 3.6.2 to compile for P2:
  fastspin -2 debug_zog.spin
  loadp2 /dev/ttyUSB1 debug_zog.binary -t
It does not use any P2 specific features yet, so multiply and divide are still done in software and there's no XBYTE loop.
Cool.
Any benchmark figures (speeds, code sizes) of this first initial version ? eg ZOG.FPGA vs ZOG.P1 vs ZOG.P2 ?
The results.txt file in the github repo tracks the benchmarks, and you can examine it to see what effect various optimizations have had. The current P2 version of the ZPU interpreter is more than twice as fast as the P1 version (at the same clock speed), but that's not an apples to apples comparison because the P2 one has some generic optimizations that the P1 one is missing.

Only twice as fast? I thought the XBYTE instruction would make it much more than that along with the higher clock speed of the P2.

ersmith · 2017-04-24 18:32

David Betz wrote: »
ersmith wrote: »
jmg wrote: »
ersmith wrote: »
I've managed to get ZOG working on P2. It's checked in to my ZOG repo (https://github.com/totalspectrum/zog). It needs fastspin version 3.6.2 to compile for P2:
  fastspin -2 debug_zog.spin
  loadp2 /dev/ttyUSB1 debug_zog.binary -t
It does not use any P2 specific features yet, so multiply and divide are still done in software and there's no XBYTE loop.
Cool.
Any benchmark figures (speeds, code sizes) of this first initial version ? eg ZOG.FPGA vs ZOG.P1 vs ZOG.P2 ?
The results.txt file in the github repo tracks the benchmarks, and you can examine it to see what effect various optimizations have had. The current P2 version of the ZPU interpreter is more than twice as fast as the P1 version (at the same clock speed), but that's not an apples to apples comparison because the P2 one has some generic optimizations that the P1 one is missing.
Only twice as fast? I thought the XBYTE instruction would make it much more than that along with the higher clock speed of the P2.

ZOG isn't using XBYTE yet, and the FPGA I'm using (DE2-115) is clocked at the same speed as P1. I've done some more optimizing on the P2 version and it's approaching 4x the P1 speed now, but again the P1 interpreter could be optimized more.

xxtea results

P1:
propgcc lmm:   21984 cycles
fastspin 3.6:  38064 cycles
propgcc cmm:  362544 cycles
riscvemu:     497824 cycles
zog JIT:      679168 cycles
zog 1.6LE:    885696 cycles
openspin:    1044640 cycles

P2:
fastspin 3.6:  18327 cycles
zog_p2:       238576 cycles
riscvemu:     243848 cycles

jmg · 2017-04-24 20:32

ersmith wrote: »

The results.txt file in the github repo tracks the benchmarks, and you can examine it to see what effect various optimizations have had.

OK - numbers are here
https://github.com/totalspectrum/zog/blob/master/results.txt

Smallest numbers in that file are

P2:
....
with rdfast plus miscellaneous P2 optimizations
xxtea: 338584 cycles

but in above post I see

xxtea results P2:
zog_p2: 238576 cycles

Is that the same test, but more optimizes ?

Looks like it could do with a version number alongside each versions stats, and I see no code-size numbers ?

What cycles does a native ZPU require for these tests ?

ersmith · 2017-04-24 21:17

jmg wrote: »

Is that the same test, but more optimizes ?

Yes, I just hadn't pushed my local changes to the repo.

Looks like it could do with a version number alongside each versions stats, and I see no code-size numbers ?

What cycles does a native ZPU require for these tests ?

The code is there, feel free to extract whatever stats you want from it

. I don't have a native ZPU (and haven't tried putting one into FPGA) so I don't know what the relative performance is. There are dhrystone benchmark figures, but I haven't tried to port dhrystone over yet.

Eric

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments