RISC V ?
Heater.
Posts: 21,230
Q: What's the quickest way to get GCC working for the P2 ?
A: Drop a RISC V processor core in there. There are fully functional RISC V compilers from GCC and Clang/LLVM. RISC V cores can be pretty small and there is a bunch of them available already, free for use, in Verilog and VHDL.
Well, I'm joking. Mostly. I don't expect such a crazy thing to happen and I certainly would not want such a thing holding up P2 progress.
However, I just did something that made my eyes pop. A little while ago Chip posted the verilog for the new P2 PRNG. It seemed short and sweet so I was inspired to install the Icarus Verilog simulator and and see if I could learn just enough Verilog to check it's output was correct. Turned out to be pretty easy. I had just installed Quartus, waiting for the new P2 with PRNG release, so of course I had to see if I could get the PRNG test running on a real FPGA. Soon I had randomly flashing LEDs on my DE0 Nano. I joked that I would now proceed to design my own CPU.
I have not designed my own CPU. I found a ready made one. The picorv32 by Clifford Wolf : https://github.com/cliffordwolf/picorv32
The prospect of getting a CPU core working looked pretty daunting but after a while looking at what Clifford has there I realized it may not be impossible. I just cut and pasted his core into a Verilog project and started wrapping it around with memory and peripherals. I did not want all of Clifford's peripherals and buses and stuff. Too complicated for this humble beginner. Besides the challenge is to learn some Verilog so I wanted to make my own peripherals, as crude as they may be. Turns out that adding memory and GPIO to such a core is dead easy.
The end result is:
32 bit RISC V integer core with MUL and DIV running at 100MHz, about 25 MIPS.
12Kbytes RAM, using the memory on the Cyclone IV of the Nano.
GPIO port driving LEDs.
UART (Not quite done yet)
A PRNG port that serves up xorshiro128+ random numbers.
It runs some "firmware" compiled with GCC for RISC V that just counts up on 8 LEDs. (The "Hello World" of embedded systems)
This all fits in 2600 Logic Elements, about 12% of the FPGA. Shrinks to 8% without MUL and DIV.
Where does this all lead?
I have no idea. Just having fun. I could stick 8 of those cores on there and make a RISC V "poor mans Propeller" There is the SDRAM to take into use. And peripherals like the DE0 nano's ADC and accelerometer. Or perhaps what about replacing a COG from the open source P1 Verilog with a picoriscv32 core? Think I have a lot to learn for that one.
Anyway, if anyone is tempted to get their feet wet with FPGA and Verilog I highly recommend it. I suggest getting hold of the Icarus Verilog compiler/simulator. It makes experimenting very quick and easy. Rather than wait for the slow and ponderous Quartus to build anything. Bit like hacking code in Python or Javascript. Also it's easy to knock up quick test harnesses so you have some confidence your gadget will work. Without Icarus I would have given up in frustration ages ago.
http://iverilog.icarus.com/
Yeah, I know this is all off topic for a P2 forum. I was just so amazed at what is possible to do relatively easily now a days I had to tell someone. Besides, it's Chip's fault for kicking me down this Verilog road Thanks Chip. Oh and it does include Chip's P2 PRNG so it is in very small part a P2!
Rude and crude as it is the code is here:
https://github.com/ZiCog/xoro
I might get round to adding some documentation when I have simplified the way to build C code for it.
A: Drop a RISC V processor core in there. There are fully functional RISC V compilers from GCC and Clang/LLVM. RISC V cores can be pretty small and there is a bunch of them available already, free for use, in Verilog and VHDL.
Well, I'm joking. Mostly. I don't expect such a crazy thing to happen and I certainly would not want such a thing holding up P2 progress.
However, I just did something that made my eyes pop. A little while ago Chip posted the verilog for the new P2 PRNG. It seemed short and sweet so I was inspired to install the Icarus Verilog simulator and and see if I could learn just enough Verilog to check it's output was correct. Turned out to be pretty easy. I had just installed Quartus, waiting for the new P2 with PRNG release, so of course I had to see if I could get the PRNG test running on a real FPGA. Soon I had randomly flashing LEDs on my DE0 Nano. I joked that I would now proceed to design my own CPU.
I have not designed my own CPU. I found a ready made one. The picorv32 by Clifford Wolf : https://github.com/cliffordwolf/picorv32
The prospect of getting a CPU core working looked pretty daunting but after a while looking at what Clifford has there I realized it may not be impossible. I just cut and pasted his core into a Verilog project and started wrapping it around with memory and peripherals. I did not want all of Clifford's peripherals and buses and stuff. Too complicated for this humble beginner. Besides the challenge is to learn some Verilog so I wanted to make my own peripherals, as crude as they may be. Turns out that adding memory and GPIO to such a core is dead easy.
The end result is:
32 bit RISC V integer core with MUL and DIV running at 100MHz, about 25 MIPS.
12Kbytes RAM, using the memory on the Cyclone IV of the Nano.
GPIO port driving LEDs.
UART (Not quite done yet)
A PRNG port that serves up xorshiro128+ random numbers.
It runs some "firmware" compiled with GCC for RISC V that just counts up on 8 LEDs. (The "Hello World" of embedded systems)
This all fits in 2600 Logic Elements, about 12% of the FPGA. Shrinks to 8% without MUL and DIV.
Where does this all lead?
I have no idea. Just having fun. I could stick 8 of those cores on there and make a RISC V "poor mans Propeller" There is the SDRAM to take into use. And peripherals like the DE0 nano's ADC and accelerometer. Or perhaps what about replacing a COG from the open source P1 Verilog with a picoriscv32 core? Think I have a lot to learn for that one.
Anyway, if anyone is tempted to get their feet wet with FPGA and Verilog I highly recommend it. I suggest getting hold of the Icarus Verilog compiler/simulator. It makes experimenting very quick and easy. Rather than wait for the slow and ponderous Quartus to build anything. Bit like hacking code in Python or Javascript. Also it's easy to knock up quick test harnesses so you have some confidence your gadget will work. Without Icarus I would have given up in frustration ages ago.
http://iverilog.icarus.com/
Yeah, I know this is all off topic for a P2 forum. I was just so amazed at what is possible to do relatively easily now a days I had to tell someone. Besides, it's Chip's fault for kicking me down this Verilog road Thanks Chip. Oh and it does include Chip's P2 PRNG so it is in very small part a P2!
Rude and crude as it is the code is here:
https://github.com/ZiCog/xoro
I might get round to adding some documentation when I have simplified the way to build C code for it.
Comments
Might have been a mistake, we'll see...
I'd like to make a custom P1V one day maybe...
I'll have to remember this Verilog Simulator.
Sounds cool.
Did you try a build for the Lattice ICE40UP5K-SG48ITR50 ? (testable on ICE40UP5K-B-EVN)
This part has 128K Bytes SRAM, and 5280 LE, but I'm unclear on how Lattice LE map to Altera LE....
Did you run the above on icarus, and what speed does icarus simulate at ?
Can icarus read a ROM file, or does it need to recompile the verilog for every simulate ?
I should have done this years ago. Hmm...actually I did, I tried some experiments in VHDL running under the GHDL simulator. But VHDL is complicated and verbose. And FPGA boards were not so cheap and readily available then.
Might take a while to get one's head around the fact that Verilog is not like a regular programming language. Potentially every statement you write can be happening at the same time. But if you are used to juggling parallel things on the Propeller it's not so shocking.
But that is where things get interesting. Clifford Wolf runs that RISC V core on some Lattice FPGA. I forget which one but they are physically tiny and very cheap.
Not only that but Clifford and a few other guys have reverse engineered the Lattice bit streams you need to configure those devices and created synthesis tools. With that one can get an FPGA working with a totally open source tool chain. Those guys are serious turbo nerds!
So yeah, a Lattice FPGA dev board is now on my want list... Yep, it all runs under Icarus and you can watch the RISC V core execute instructions. Trace the memory accesses etc. I guess it's dead slow. Good enough to dump a few hundred or thousand RISC V instruction steps per second. Good enough to see something actually works or not.
What I have been doing mostly is using Icarus to develop/test the components. Eg. Create a UART, create a test bench for it, play with it till it works. Then integrate to the project. Icarus may be slow but the edit/test cycle is fast. As I said, like hacking code in Python. Icarus compiles code into some kind of byte code. For example: compiles the uart and it's test bench into a uart_tb.vvp file. Which can then be run: This is all very quick for a simple module test.
If you feel the need for speed, or have a huge design that is slow to simulate then there is the "verilator". That compiles verilog into C++. Which you can then compile and run. I did not manage to get that working yet.
I should add this kind of stuff to docs in the repo.
Be interested if you do get that working, with speed stats on RISC V, as that seems a good way to get an exact P2 Simulator.
I believe the AVR simulator Atmel have, works this way - they feed it the chip design files, and get an EXE/DLL out.
Could be the iCE40, that 128k RAM family member is very new. (~$6) Eval boards in stock, but no disti-silicon yet.
Do you have any links to his Lattice work, the link above mentions only Xilinx (but does hit some impressive MHz numbers)
There are a ton of open source RISC-V implementations available now (e.g. the VectorBlox Orca which is made for FPGAs). RISC-V hardware is becoming available now from companies like SiFive. It'll be very interesting to see where it all ends up.
That is certainly an interesting way to get an accurate P2 simulator.
Yep, it's the iCE40 HX8K.
Be prepared to be amazed: Fully opensource FPGA synthesis tools, demo with the picoriscv32 and 128K RAM on iCE40 HX8K, even a board for the Raspberry Pi:
http://www.clifford.at/papers/2015/icestorm-flow/slides.pdf
http://www.clifford.at/icestorm/
Oh yeah, 8 picorv32 cores and some kind of HUB memory was on my mind too. Should just about fit in the nano. I didn't realize a RISC V core could be so small.
As you say, RISC-V is extensible. With a few carefully crafted extensions to the instruction set and the smart pins it would make an excellent P3. I can't imagine Chip going for it though. The RISC V instruction set is designed to be compiler friendly not human assembler coder friendly. Consider this for example:
The Orca core looks interesting too. Should be an easy drop in replacement for the picorv32 here.
Anyway, how does this compare to a P2 COG? In terms of logic elements/chip area and speed?
I find these stats I guess the packed LCs is the chip usage value. Interesting Lattice packs better...
The new ICE40UP5K-SG48ITR50 has slightly less Logic 5280 LC (vs 7680 LC on HX8K), but is has an easier QFN48 package, and adds 128kBytes SRAM (vs 128k bits), & based on those stats, it should be roughly half full.
Space for a P1V COG ?
I'm pretty sure you can write it like that: The assembler will produce 2 instructions for li that will be a LUI and an ADDI to compose a 32bit constant, just like the Prop2 assembler produces an AUGS and a MOV for every ## for the equivalent Prop2 code:
I don't see a big difference.
Andy
Did you find the verilog source for picoriscv32 on iCE40 HX8K ?
I can find a lot of tools links, but the source file for picoriscv32 for Lattice tools is proving elusive ?
I did find this, which is another FPGA board - looks nice.
https://mystorm.uk/
https://folknologylabs.wordpress.com/2016/08/03/storm-in-a-pint-pot/
and one comment here :
https://news.ycombinator.com/item?id=12193769
" cliffordvienna 245 days ago [-]
There is no 4K die. The 4K chips are using 8K dies, the lattice software limits the number of usable LUTs to 4K. IceStorm will give you access to all 8K LUTs in the device. "
Lattice may not like that information leaking out ..
I downloaded the latest iCEcube2 tools, and did a dummy run on a iCE40UP5K – SG48
All seems ok, Synth -> P&R with green ticks everywhere.
This iCE40UP5K is quite a recent addition, here is the change log ...
iCEcube2 Version 2017.01
Added VPP_2V5_TO_1P8V synthesis feature for iCE40 Ultra and iCE40 UltraPlus devices.
Enhanced auto assign SPI dedicate pin if no SPI instance.
Fixed bug for RGB LED driver pin mapping issue in iCE40 UltraPlus device.
Fixed bug for I3C simulation model for iCE40 UltraPlus device.
iCEcube2 Version 2016-12
Removed the license control for iCE40UP5K – UWG30, iCE40UP3K – UWG30 and iCE40UP5K – SG48
iCEcube2 Version 2016.08
This version of the iCEcube2 software adds support for the device-package combinations:
iCE40UP5K-UWG30
iCE40UP3K-UWG30
iCE40UP5K-SG48
Enhanced Pin Constraint Editor with pullup/weakpullup constraints process.
Added support for Windows 10 OS, 32-bit and 64-bit.
Added Pack Area option to placer tool options.
Added VPP_2V5_TO_1P8V synthesis feature for ICE40 UltraLite devices
You are right. The assembler does deal with things like "li a5,0xffff0006" by generating two instructions.
My gut does not like the idea of an assembler producing extra instructions behind my back. That's what compilers are for.
But in cases like this it makes a lot of sense. Nobody wants to dick around figuring out how split immediates up for loading. And I guess it's not much more of a worry than an Intel assembler producing huge sequences instruction and operand bytes whose length depends on the actual values and addressing modes you use.
Next up I have to turn on the RISC V compressed code feature and see what space savings we get.
http://icoboard.org/
http://icoboard.org/risc-v.html
Which points us to icoSoC which runs on that board:
https://github.com/cliffordwolf/icotools/tree/master/icosoc
Which contains a copy of the picorv32 core:
https://github.com/cliffordwolf/icotools/blob/master/icosoc/common/picorv32.v
So I guess if you follow the installation instruction in that repo you end up with a RISC V SoC for iCE40.
I love that tidbit about getting around the 4K limit.
Honestly I think this whole IceStorm thing is huge. I mean, we can now develop for FPGA in Verilog using totally Open Source tools, even running on a Raspberry Pi. That is a monumental achievement. I'm surprised I have not seen any talk of it on the Raspi forums.
Soon it's goodbye clunky Quartus for me !
All in all not something the RISC-V designers or people writing assemblers for it worry about. RISC-V is intended as a general purpose instruction set architecture.
Also the Propeller 2 does that (with AUGx).
Load Immediate (LI) is not a native RISC-V instruction, it's an assembler pseudo instruction to simplify the load of constants, just like ## on the P2.
You can write: if you want to see every instruction.
The big difference between a Propeller (1 or 2) and RISC-V is in the tight integration of counters and ports with the instructions on the Propeller. This allows fast bitbanged software peripherals which are much harder on RISC-V.
On RISC-V the ports are normally memory mapped which needs separate instructions for Load Modify and Store.
What is just an XOR OUTA,#1 on the Propeller, becomes: on RISC-V. And Load/Store are often one of the slower instructions.
Same for things like WAITCNT or WAITPNE.
So if you want a Propeller like multicore with RISC-V cores, you will need custom instructions that allow tight integration with ports and counters.
Andy
That's not the issue though. The issue is simply extra instructions being generated by the assembler that you did not explicitly write. Which complicates simple minded instruction counting when making tight bit banging loops and so on. Also if you increase the size of a literal all of a sudden your code gets bigger!
Anyway, I'm not inclined to worry about that much. My RISC-V will be in FPGA, unless someone starts selling actual RISC-V chips, so any such bit banging will be done in Verilog!
Certainly the tight integration of I/O into the COG instruction set is a wonderful thing.
I was pondering the idea of RISC-V extensions for such bit banging and timing. The picorv32 core has a coprocessor interface for exactly that purpose. Currently only used for the optional MUL and DIV instructions. I was starting wonder how easy it might be to add my own instructions to that interface for ports and counters etc.
RISC-V is like being stuck in HUBEXEC. Several of them would be like several COGS stuck in HUBEXEC.
This isn't a bad thing! It's just not tuned for real-time, embedded, sense, process, response like we are aiming for.
Imagine a RISC-V with tons more registers. Now, imagine running code out of the registers themselves. That's a COG.
If I imagine a RISCV with tons of registers and executing from regs, it's not a RISC V anymore!
Horses for courses and all that.
If you haven't done yet, you should take a look at another project he has made: Yosys (Open Verilog SYnthesis Suite)
http://www.clifford.at/yosys/
Next step is to download OpenRAM (open Memory Compiler) and then you have almost 95% to make a P1 in 0.5um (FreePDK45 / SCN3ME_SUBM)
https://github.com/mguthaus/OpenRAM
OpenRAM was released around 6 months ago.
This post is from 2 years ago, when openRAM was not released yet: https://www.reddit.com/r/yosys/comments/2g426s/readmemh_support/
On a Propeller, they are one and the same. It's a register when you want it to be, small, local memory when you want it to be.
As for need, a Prop is a memory to memory direct design at the COG level. Code and Data are unified, registers / memory, etc... This means avoiding the load / store cycle, which improves throughput and real-time response.
At the HUB level, a Prop is a load-store machine, just having a ton of registers.
One distinction is the I/O is memory mapped, but in the same space as the COG memory is, or it's dedicated, accessed by implied addressing.
This makes it a micro-controller, in my view, as that generally isn't the model used for general purpose computing. This also makes it very fast in terms of sense, process, response too.
Finally, we have some shared resources, like CORDIC, PRNG. Most things are relative to a COG, cloned to maximize both throughput and real time.
Should we get it done this year, it's going to be distinctive. Capable of things at a process and clock speed that is hard to beat.
Perhaps I'll get to looking at Yosys sometime. Just now I'm still feeling my way around Verilog itself.
Yosys is verilog too (same as Quartus or Vivado). You can do both.
http://www.clifford.at/yosys/files/yosys_presentation.pdf
Are you curious about how can be xorshiro128+ implemented with simple TTL logic gates? You can use the xorshiro128+ code as input for yosys and It will show you the logic gates needed to implement it. There is even one comand that will show you a graph of that.
He made some automated scripts to compare the output of his verilog program (yosys) with Quartus and Vivado : http://www.clifford.at/yosys/vloghammer.html
And found many bugs in both Yosys and commercial tools.
Completely mind blowing !
ArachnePNR is the Place and route tool of IceStorm which decides which LUTs are used and routes them correctly according the netlists. The output is the configuration bitstream.
It's mainly the Place and Route that takes so long in Quartus and other commercial tools. Arachne is blending fast in comparsion, at the cost of a bigger LUT count.
But the supported ICE40 FPGAs are quite limited. Only the bitstreams of older ICE40 types are known. They have max 8k LUTs and no Multipliers.
So don't expect you can use it for a P1V bigger than 4 cogs with some custom peripherals.
The supported FPGA's may not be so new or big but if it's enough to put a 32 bit RISCV core onto with a usable memory space and room spare for some custom logic that is very useful.
Why would one need 16 cores of P2 when the functions you want to create can be done in Verilog on a super cheap FPGA?
What with open source tools that run a Raspberry Pi we might see 10 year olds knocking out their own logic designs!
I might be checking out this Yosys tool chain sooner than I thought, thanks Ramon for the little push there.