Can we follow protocol and you make a pull request on my repo ?
Which repo would that be? I looked on your github page and couldn't find a repo for Zog, but maybe I overlooked it? Or is it somewhere other than github?
Hmm - maybe heater can have a RISC V surrounded by ZPU minions in a future FPGA. I assume that the ZPU is smaller than a RISC V, but haven't verified. I hacked on a Propeller COG core out of curiosity and it seemed larger than RISC V at least in the Lattice iCE40 parts. (The free tools don't like the multidimensional wires which Chip used as muxes.)
It looks like it might almost be a drop in replacement for the picorv32. I had half a mind to do that and see how big it came out when compiled under Quartus, even if I never got it running.
Wikipedia says this: "The base ZPU has a 32-bit data path. The ZPU also has a variant with a 16-bit-wide data path, to save even more logic."
Do you know if the ZPU-GCC supports also the 16bit wide ZPUs (16bit int and 32bit long)?
I have not found a commandline switch for that.
For small FPGAs and simple control tasks, a 16bit version of the ZPU would be very welcome.
The picorv32 only takes 7% of a DE0 Nano and fits in those cheap little Lattice ICE 40 devices with plenty of room spare. One can even make a version that uses 16 registers instead of 32 for extra shrinkage.
Wikipedia says this: "The base ZPU has a 32-bit data path. The ZPU also has a variant with a 16-bit-wide data path, to save even more logic."
Do you know if the ZPU-GCC supports also the 16bit wide ZPUs (16bit int and 32bit long)?
I have not found a commandline switch for that.
For small FPGAs and simple control tasks, a 16bit version of the ZPU would be very welcome.
I don't see any options for anything other than a 32 bit version of GCC for the ZPU. Perhaps the data path is 16 bits wide but all instructions still operate on 32 bits? I'm not sure, I haven't seen any references to the smaller ZPU (other than Wikipedia).
I couldn't find any reference either. I don't think that reducing the number of registers will reduce the FPGA area given that they are implemented in RAMs. This paper did come up about a bit serial approach for a MIPs ALU:
Thank you for checking.
As long as I only use short data types the 32bit GCC may work well also for 16bit. But sometimes you need 32bit calculations. I tried long long and ZPU-GCC produces a lot of code to work with two ram locations per variable. This will then just tie two 16bit values together on a 16bit implementation. Data arrays however must be declared as int, not long long.
All in all a bit strange...
Heater
"How small an FPGA do you want to go?"
Lattice has some nice, cheap 1.2k-LUT FPGAs with Flash on chip and single supply voltage (MachXO2/3). There are even smaller ones, but they have not enough RAM. My observation is: for these little FPGAs and size optimized CPUs the LUT count nearly doubles for 32bit, compared to 16bit. Also the max frequency falls to 2/3 of the 16bit version.
These FPGA's are something I'm only just getting to grips with. The last time I used any programmable logic it was PALs back in the 1980s !
Clearly the are all kinds of CPU cores one could use of all kind of sizes. That is at least what I gather from opencores.org.
I'm attracted to the RISC V because I'm sure compiler support will be around for a long time.
I'm looking forward to getting some cheap little Lattice ICE 40 parts to play with. With the added bonus that there are now Open Source tools to configure them.
Lattice has some nice, cheap 1.2k-LUT FPGAs with Flash on chip and single supply voltage (MachXO2/3). There are even smaller ones, but they have not enough RAM. My observation is: for these little FPGAs and size optimized CPUs the LUT count nearly doubles for 32bit, compared to 16bit. Also the max frequency falls to 2/3 of the 16bit version.
What about the speed of 32b counters ?
I like the approach P2 has taken, which is effectively choose a SysCLK that is the Counter-limit, and then the core runs at SysCLK/2
The focus on 1 clk per opcode was marketing driven, and is not best use of silicon.
16b is out of fashion, and you are right about code size, but one limit for 16b is counters should really be 32b not 16b, for sensible dynamic range.
I cannot understand why anyone makes a 32b MCU with 16b timers.
Some vendors do a 24b+8b timer, where 8b selects the prescaler, and that can save needing 2 registers to config, at some loss of precision.
NXP I think have 32b prescaler and 32b timer, which is very easy to understand, if a little more silicon than strictly anyone needs.
I've managed to get ZOG working on P2. It's checked in to my ZOG repo (https://github.com/totalspectrum/zog). It needs fastspin version 3.6.2 to compile for P2:
I've managed to get ZOG working on P2. It's checked in to my ZOG repo (https://github.com/totalspectrum/zog). It needs fastspin version 3.6.2 to compile for P2:
I've managed to get ZOG working on P2. It's checked in to my ZOG repo (https://github.com/totalspectrum/zog). It needs fastspin version 3.6.2 to compile for P2:
It does not use any P2 specific features yet, so multiply and divide are still done in software and there's no XBYTE loop.
Cool.
Any benchmark figures (speeds, code sizes) of this first initial version ? eg ZOG.FPGA vs ZOG.P1 vs ZOG.P2 ?
The results.txt file in the github repo tracks the benchmarks, and you can examine it to see what effect various optimizations have had. The current P2 version of the ZPU interpreter is more than twice as fast as the P1 version (at the same clock speed), but that's not an apples to apples comparison because the P2 one has some generic optimizations that the P1 one is missing.
I've managed to get ZOG working on P2. It's checked in to my ZOG repo (https://github.com/totalspectrum/zog). It needs fastspin version 3.6.2 to compile for P2:
It does not use any P2 specific features yet, so multiply and divide are still done in software and there's no XBYTE loop.
Cool.
Any benchmark figures (speeds, code sizes) of this first initial version ? eg ZOG.FPGA vs ZOG.P1 vs ZOG.P2 ?
The results.txt file in the github repo tracks the benchmarks, and you can examine it to see what effect various optimizations have had. The current P2 version of the ZPU interpreter is more than twice as fast as the P1 version (at the same clock speed), but that's not an apples to apples comparison because the P2 one has some generic optimizations that the P1 one is missing.
Only twice as fast? I thought the XBYTE instruction would make it much more than that along with the higher clock speed of the P2.
I've managed to get ZOG working on P2. It's checked in to my ZOG repo (https://github.com/totalspectrum/zog). It needs fastspin version 3.6.2 to compile for P2:
It does not use any P2 specific features yet, so multiply and divide are still done in software and there's no XBYTE loop.
Cool.
Any benchmark figures (speeds, code sizes) of this first initial version ? eg ZOG.FPGA vs ZOG.P1 vs ZOG.P2 ?
The results.txt file in the github repo tracks the benchmarks, and you can examine it to see what effect various optimizations have had. The current P2 version of the ZPU interpreter is more than twice as fast as the P1 version (at the same clock speed), but that's not an apples to apples comparison because the P2 one has some generic optimizations that the P1 one is missing.
Only twice as fast? I thought the XBYTE instruction would make it much more than that along with the higher clock speed of the P2.
ZOG isn't using XBYTE yet, and the FPGA I'm using (DE2-115) is clocked at the same speed as P1. I've done some more optimizing on the P2 version and it's approaching 4x the P1 speed now, but again the P1 interpreter could be optimized more.
Yes, I just hadn't pushed my local changes to the repo.
Looks like it could do with a version number alongside each versions stats, and I see no code-size numbers ?
What cycles does a native ZPU require for these tests ?
The code is there, feel free to extract whatever stats you want from it . I don't have a native ZPU (and haven't tried putting one into FPGA) so I don't know what the relative performance is. There are dhrystone benchmark figures, but I haven't tried to port dhrystone over yet.
Comments
Which repo would that be? I looked on your github page and couldn't find a repo for Zog, but maybe I overlooked it? Or is it somewhere other than github?
Thanks,
Eric
Looks like you are the owner of the Zog project now
There is a ZPU core in verilog in the ZPU repo: https://github.com/zylin/zpu/blob/master/zpu/hdl/avalanche/core/zpu_core.v
It looks like it might almost be a drop in replacement for the picorv32. I had half a mind to do that and see how big it came out when compiled under Quartus, even if I never got it running.
Wikipedia says this: "The base ZPU has a 32-bit data path. The ZPU also has a variant with a 16-bit-wide data path, to save even more logic."
Do you know if the ZPU-GCC supports also the 16bit wide ZPUs (16bit int and 32bit long)?
I have not found a commandline switch for that.
For small FPGAs and simple control tasks, a 16bit version of the ZPU would be very welcome.
Andy
How small an FPGA do you want to go?
The picorv32 only takes 7% of a DE0 Nano and fits in those cheap little Lattice ICE 40 devices with plenty of room spare. One can even make a version that uses 16 registers instead of 32 for extra shrinkage.
I don't see any options for anything other than a 32 bit version of GCC for the ZPU. Perhaps the data path is 16 bits wide but all instructions still operate on 32 bits? I'm not sure, I haven't seen any references to the smaller ZPU (other than Wikipedia).
Eric
"Ultrasmall: A Tiny Soft Processor Architecture with Multi-Bit Serial Datapaths for FPGAs"
https://www.jstage.jst.go.jp/article/transinf/E98.D/12/E98.D_2015PAP0022/_pdf
Thank you for checking.
As long as I only use short data types the 32bit GCC may work well also for 16bit. But sometimes you need 32bit calculations. I tried long long and ZPU-GCC produces a lot of code to work with two ram locations per variable. This will then just tie two 16bit values together on a 16bit implementation. Data arrays however must be declared as int, not long long.
All in all a bit strange...
Heater
"How small an FPGA do you want to go?"
Lattice has some nice, cheap 1.2k-LUT FPGAs with Flash on chip and single supply voltage (MachXO2/3). There are even smaller ones, but they have not enough RAM. My observation is: for these little FPGAs and size optimized CPUs the LUT count nearly doubles for 32bit, compared to 16bit. Also the max frequency falls to 2/3 of the 16bit version.
Andy
Clearly the are all kinds of CPU cores one could use of all kind of sizes. That is at least what I gather from opencores.org.
I'm attracted to the RISC V because I'm sure compiler support will be around for a long time.
I'm looking forward to getting some cheap little Lattice ICE 40 parts to play with. With the added bonus that there are now Open Source tools to configure them.
I like the approach P2 has taken, which is effectively choose a SysCLK that is the Counter-limit, and then the core runs at SysCLK/2
The focus on 1 clk per opcode was marketing driven, and is not best use of silicon.
16b is out of fashion, and you are right about code size, but one limit for 16b is counters should really be 32b not 16b, for sensible dynamic range.
I cannot understand why anyone makes a 32b MCU with 16b timers.
Some vendors do a 24b+8b timer, where 8b selects the prescaler, and that can save needing 2 registers to config, at some loss of precision.
NXP I think have 32b prescaler and 32b timer, which is very easy to understand, if a little more silicon than strictly anyone needs.
(The ZPU programs must be compiled with the little endian zpugccle to work my version of ZOG. The source for that is in https://github.com/totalspectrum/zpugccle).
Eric
Any benchmark figures (speeds, code sizes) of this first initial version ? eg ZOG.FPGA vs ZOG.P1 vs ZOG.P2 ?
The results.txt file in the github repo tracks the benchmarks, and you can examine it to see what effect various optimizations have had. The current P2 version of the ZPU interpreter is more than twice as fast as the P1 version (at the same clock speed), but that's not an apples to apples comparison because the P2 one has some generic optimizations that the P1 one is missing.
ZOG isn't using XBYTE yet, and the FPGA I'm using (DE2-115) is clocked at the same speed as P1. I've done some more optimizing on the P2 version and it's approaching 4x the P1 speed now, but again the P1 interpreter could be optimized more.
OK - numbers are here
https://github.com/totalspectrum/zog/blob/master/results.txt
Smallest numbers in that file are
P2:
....
with rdfast plus miscellaneous P2 optimizations
xxtea: 338584 cycles
but in above post I see
xxtea results P2:
zog_p2: 238576 cycles
Is that the same test, but more optimizes ?
Looks like it could do with a version number alongside each versions stats, and I see no code-size numbers ?
What cycles does a native ZPU require for these tests ?
The code is there, feel free to extract whatever stats you want from it . I don't have a native ZPU (and haven't tried putting one into FPGA) so I don't know what the relative performance is. There are dhrystone benchmark figures, but I haven't tried to port dhrystone over yet.
Eric