I quickly hacked something together to try it out. Here are some size results, but I had to edit memory.v to reduce the memory from 48 to 12 kbytes. And multiple files to get rid of the tristate bus. (And I just disabled the PLL)
Based on the time that it takes to run this on an old x86 laptop I think that the Pi is out as a vehicle for this magnitude of hardware development. (The simulation and waveform viewing is fine though.) It would be interesting to know how long it takes on a Surface since Quartus was able do it in a minute. This is an old x86 MacBook - I think that first x86 laptop that Apple produced.
... but lattice add a large white arrow on the silkscreen in case you overlook their part
That's definitely a chuckle right there.
I used to work for Weitek, and some companies would advertise their own systems in various trade publications, and include a shot of the PCB. Eventually some of them started blurring the Weitek chips because they were too prominent. Good free advertising while it lasted. They were BIG chips.
They also have RISC-V development boards (with a real RISC-V ASIC, not an FPGA).
I can't find a price, or full data, on the FE310 chip, (or the larger U500) but it mentions a few things....
"The FE310 features SiFive’s E31 CPU Coreplex, a 32-bit RV32IMAC core running at 320+ MHz. Additional features include a 16KB L1 Instruction Cache, a 16KB Data SRAM scratchpad, hardware multiply/divide, debug module, one-time programmable non-volatile memory (OTP), flexible clock generation with on-chip oscillators and PLLs, and a wide variety of peripherals including UARTs, QSPI, PWMs and timers. Multiple power domains and a low-power standby mode ensure a variety of applications can benefit from the FE310, which was fabricated in TSMC 180nm.
... a dedicated off-chip Quad-SPI flash controller for execute-in-place, 8 KiB of in-circuit programmable OTP memory, 8 KiB of mask ROM,
"
That 180nm process is the same as P2, right (but at OnSemi, not TSMC for P2) ?
The good news there is if 180nm can give 320MHz, maybe 160MHz for P2 is not going to slip down ?
XIP sounds good too..
Heater - there's some verilog and scripts in there are well, so you can compare your approach with Wolf's. There's even an "On-chip logic analyzer" and a simple SPI master.
Edited to add: to look at this it's best to build an example. e.g. in icosoc/examples/hello run make testbench_vcd. You see this error:
icosoc.v:51: error: NULL port declarations are not allowed.
icosoc.mk:106: recipe for target 'testbench' failed
This is a typical trailing comma error in verilog. Just edit isosoc.mk to eliminate this trailing comma:
The example in scripts icestorm subdir seems to be tailored for small parts. I don't think that it would fit in the 1 k LUTs version:
Design Information
Command line: map -a MachXO2 -p LCMXO2-7000HE -t TQFP144 -s 4 -oc Commercial
picorv_picorv.ngd -o picorv_picorv_map.ncd -pr picorv_picorv.prf -mp
picorv_picorv.mrp -lpf C:/02_Elektronik/044_RISCV/07_PICORV/02_Implementati
ons/02_MachXO2/picorv/picorv_picorv_synplify.lpf -lpf C:/02_Elektronik/044_
RISCV/07_PICORV/02_Implementations/02_MachXO2/picorv.lpf -c 0 -gui
Target Vendor: LATTICE
Target Device: LCMXO2-7000HETQFP144
Target Performance: 4
Mapper: xo2c00, version: Diamond (64-bit) 3.6.0.83.4
Mapped on: 04/15/17 06:55:06
Design Summary
Number of registers: 757 out of 7209 (11%)
PFU registers: 749 out of 6864 (11%)
PIO registers: 8 out of 345 (2%)
Number of SLICEs: 854 out of 3432 (25%)
SLICEs as Logic/ROM: 710 out of 3432 (21%)
SLICEs as RAM: 144 out of 2574 (6%)
SLICEs as Carry: 125 out of 3432 (4%)
Number of LUT4s: 1672 out of 6864 (24%)
Number used as logic LUTs: 1134
Number used as distributed RAM: 288
Number used as ripple logic: 250
Number used as shift registers: 0
Number of PIO sites used: 9 + 4(JTAG) out of 115 (11%)
Number of block RAMs: 5 out of 26 (19%)
Thanks for the heads up on the Memory Model Verification paper. TL;DR most of it but a quick
scan shows they found problems with the memory model/atomics specification of RISC V. Luckily I don't think the atomics ISA extension is finalized yet so there is time to fix bugs or tighten up ambiguities in the spec.
Boy that is very complicated. What with juggling caches, pipelines, instruction reordering, shared memory, etc.
My brushes with multi-processor bugs were child's play by comparison.
Back in the early 1980's I worked on a project that multiple Intel 8085's sharing memory. The whole memory sharing thing had been implemented in TTL. I was the software guy but found myself debugging intermittent failures with scope and LSA. Turned out they were violating timing on asserting HOLD to the CPU. This only caused problems with chips made in one Intel FAB, others were OK! Adding a flop or two fixed that.
Later the Marconi company had developed their own 16 bit processor using AMD 2900 bit slice chips. Again intermittent problems struck our software. Again it's out with the scope and LSA. Turned out their LOCK EXCHANGE instruction did not work. Oh yes, said the hardware team, locks are not implemented on the version of processor you have there, have these new cards.
Grrr...
I sometimes day dream of putting multiple picorv32 on the DE0 nano. No shared memory. I want them hooked up with hardware comms channels.
Thank's Chip, and others, for the comments on SPI.
I want to keep this as simple as possible so as to work with the ADC on the DE0 Nano. No slave mode needed. I also don't want to write a slave mode emulation for a test bench.
I'm going for a simple test bench that will check that enable, clock and data is coming out as per the DE0 manuals description. It can also inject some data bits at the right times to see if they get clocked in nicely.
Any way, that is on hold till I get my Makefile and C plus newlib building working properly.
Thanks for the heads up on the Memory Model Verification paper. TL;DR most of it but a quick
scan shows they found problems with the memory model/atomics specification of RISC V. Luckily I don't think the atomics ISA extension is finalized yet so there is time to fix bugs or tighten up ambiguities in the spec.
Boy that is very complicated. What with juggling caches, pipelines, instruction reordering, shared memory, etc.
My brushes with multi-processor bugs were child's play by comparison.
Back in the early 1980's I worked on a project that multiple Intel 8085's sharing memory. The whole memory sharing thing had been implemented in TTL. I was the software guy but found myself debugging intermittent failures with scope and LSA. Turned out they were violating timing on asserting HOLD to the CPU. This only caused problems with chips made in one Intel FAB, others were OK! Adding a flop or two fixed that.
Later the Marconi company had developed their own 16 bit processor using AMD 2900 bit slice chips. Again intermittent problems struck our software. Again it's out with the scope and LSA. Turned out their LOCK EXCHANGE instruction did not work. Oh yes, said the hardware team, locks are not implemented on the version of processor you have there, have these new cards.
Grrr...
I sometimes day dream of putting multiple picorv32 on the DE0 nano. No shared memory. I want them hooked up with hardware comms channels.
With the MUL, DIV and barrel shifter left out the whole build comes to 8% of the nano. So eight cores should fit. They would be bit tight for RAM though.
With the MUL, DIV and barrel shifter left out the whole build comes to 8% of the nano. So eight cores should fit. They would be bit tight for RAM though.
Wow! That's quite a few. Sounds like I should be firing up my DE0 Nano! :-)
With the MUL, DIV and barrel shifter left out the whole build comes to 8% of the nano. So eight cores should fit. They would be bit tight for RAM though.
How much savings do you see leaving those out in the Cyclone IV? I was wondering if multiplication might be more efficiently implemented there than in the Lattice iCE40 parts.
Total number of logic levels: 32
Total path delay: 15.33 ns (65.22 MHz)
The Lattice breakout board seems like the cheapest way to get started (~$40 USD) with a decent sized part. Other boards do have more features, but are at least twice as expensive. It would be nice to add some NV storage to the breakout board.
The Lattice breakout board seems like the cheapest way to get started (~$40 USD) with a decent sized part. Other boards do have more features, but are at least twice as expensive. It would be nice to add some NV storage to the breakout board.
Just wanted to mention that the iCE40 internal nonvolatile configuration memory is one-time programmable. So the breakout boards docs don't even mention using it.
Edited to add: if my calculations are correct the SPI FLASH on the breakout board is ~30X larger than needed. I wonder if there's a way to access it from the FPGA fabric? The iCE40 Programming and Configuration guide lists the required bitstream sizes.
How much savings do you see leaving those out in the Cyclone IV? I was wondering if multiplication might be more efficiently implemented there than in the Lattice iCE40 parts.
Here is what I am seeing on the nano (In order of removing things) :
Everything: 2717 LE, 1276 regs, 12%
No barrel shifter: 2564 LE, 1219 regs, 11%
No div: 2143 LE, 1019 regs, 10%
No mul: 1964 LE, 928 regs, 9%
Sadly the "No mul" option stops my "Hello world!" code from running. Not sure why, as far as I can tell it does not use a MUL anywhere!
KeithE,
Sadly the "No mul" option stops my "Hello world!" code from running. Not sure why, as far as I can tell it does not use a MUL anywhere!
Did you use the right compiler? The one in the directory riscv32i (no m or c in the name)? Not that it explains why code without a multiplication wouldn't build correctly.
For fun I tried using one of the ice40 SPI pins as an LED pin, and the ice storm tools didn't complain. If I look in the pinout spreadsheet, then I need to include those 4 pins to get to the maximum of 206 that's listed in the family selection guide table. So maybe the included SPI flash could be used. Do I want to shell out $40 to know for sure...?
Edited to add: I don't know how I missed this before, but in the iCE40 Programming and Configuration guide it clearly states:
After configuration, the SPI port pins are available to the user-application as additional PIO pins, supplied by the VCC_SPI input voltage.
So this seems nice. If you put enough boot code into the config file, then you can probably use this built-in SPI FLASH to get multiple megabytes of storage. No expansion board necessary.
Those are probably double function pins, after configuration they can be used as user IOs.
I tried replacing the tristate bus with a wired-or BUS as Chip suggested, it raised the Fmax to 57 MHz (from like 54),,, I was hoping for bigger savings.
Now the CPU build without MUL works again for "Hello World".
So you've always been using riscv32i? Then it makes no sense to me at all.
Also if it's really easy you might try instantiating multiple xoro_tops and see how that affects the Quartus runtime. At some utilization you'll probably hit the hockey stick part of the curve.
Here's something crazy that someone could do. Put picorv32 and a J1a in the same image. Write a RISC V assembler in SwapForth, allow the J1a to poke into the RISC V RAM. You would probably want one of the built-in "logic analyzers" as well to trace execution ;-)
I'm new to the game but I suspected that as you fill up the FPGA all that placing and routing gets harder and harder and takes longer and longer as it tries to squeeze things in.
I won't be trying any such experiments until I have my simple plans working. A proper main() "Hello world" being first on the list.
For sure I am not going anywhere near that Forth gibberish!
That low run time is pretty cool - what utilization have you hit?
I have a DE0-Nano-SoC board in the garage which is a nice upgrade to that board. It's a Cyclone V instead of IV so this comparison could be a little off. For 25% more money it's got about 1.8X as many LEs and 4X as much embedded RAM. Oh - and a dual-core ARM running at 925 MHz which can boot linux from an SD card and has Ethernet. I did some little designs in it which interfaced to the ARMs, so my C code running on them could talk to the logic. I would have to resurrect my linux VM with all of those tools on it and remember how they work. The fabric can also access the SDRAM controller, so there's potentially a lot of off-chip RAM available. It would be cool if the RISC V tools (or Propeller 2!) could be run right on the embedded ARMs. But I always used cross compilation to build software for the ARMs. (Relatively easy to follow recipe)
The one thing about that DE0-Nano-SoC board is the Cyclone V gets hot even when not doing much. A friend has one as well and experiences the same thing. I wouldn't feel comfortable running it without a heatsink. (It doesn't come with one.)
KeithE,
Anyway, adding all that Linux complexity, with build tools running in VMs, etc, is really something I want to avoid. Been there done that.
That board can be an "educational" experience - you get your money's worth. Fortunately there are a lot of somewhat aging demos and documentation. But you'll get exposure to Das-UBoot, an obscure linux "distro", having to map physical memory to user space, using QSys to setup interconnect between the FPGA fabric and the hard-processer system (one little dot can mean a bunch of logic including multi-bit clock domain crossing), the AXI bus (you can easily convert to APB which is simple, or whatever Altera calls the other option), cross compilation, and probably stuff that I forgot about. One user wrote up some stuff here - https://digibird1.wordpress.com/playing-with-the-cyclone-v-soc-system-de0-nano-soc-kitatlas-soc/
Comments
Not yet sure how to get a timing report.
Based on the time that it takes to run this on an old x86 laptop I think that the Pi is out as a vehicle for this magnitude of hardware development. (The simulation and waveform viewing is fine though.) It would be interesting to know how long it takes on a Surface since Quartus was able do it in a minute. This is an old x86 MacBook - I think that first x86 laptop that Apple produced.
Disabling BARREL_SHIFTER And disabling ENABLE_DIV And disabling ENABLE_FAST_MUL - the big payoff. And ENABLE_PCPI - why not?
Anyways - all of this helps a lot with runtime, and leaves room for logic outside of the CPU. I'll have to give this a shot on the Pi after all.
Edited to add: then if I run: I get: For the even simpler example script that's included in picorv32 I get:
I can't find a price, or full data, on the FE310 chip, (or the larger U500) but it mentions a few things....
"The FE310 features SiFive’s E31 CPU Coreplex, a 32-bit RV32IMAC core running at 320+ MHz. Additional features include a 16KB L1 Instruction Cache, a 16KB Data SRAM scratchpad, hardware multiply/divide, debug module, one-time programmable non-volatile memory (OTP), flexible clock generation with on-chip oscillators and PLLs, and a wide variety of peripherals including UARTs, QSPI, PWMs and timers. Multiple power domains and a low-power standby mode ensure a variety of applications can benefit from the FE310, which was fabricated in TSMC 180nm.
... a dedicated off-chip Quad-SPI flash controller for execute-in-place, 8 KiB of in-circuit programmable OTP memory, 8 KiB of mask ROM,
"
That 180nm process is the same as P2, right (but at OnSemi, not TSMC for P2) ?
The good news there is if 180nm can give 320MHz, maybe 160MHz for P2 is not going to slip down ?
XIP sounds good too..
http://icoboard.org/get-started-with-your-icoboard-and-a-raspi.html
and:
https://github.com/cliffordwolf/icotools
Heater - there's some verilog and scripts in there are well, so you can compare your approach with Wolf's. There's even an "On-chip logic analyzer" and a simple SPI master.
Edited to add: to look at this it's best to build an example. e.g. in icosoc/examples/hello run make testbench_vcd. You see this error: This is a typical trailing comma error in verilog. Just edit isosoc.mk to eliminate this trailing comma: Run make again, and you can see "Hello World!" printed out. I just aborted the simulation and looked in isosoc.v to see how everything is wired.
If anyone finds a Lattice tool-flow version for iCE40 I could try that into the 5280 LUT part.
The iCE40 LUTs seems not quite the same as MachXO LUTs.
You would also need to add a QuadSPI / HyperFLASH XIP block for the smallest pin count parts.
And perfectly working code for the MachXO2 synthesized with synplify doesn't always synthesize with synplify for the iCE series, no idea why
Thanks for the heads up on the Memory Model Verification paper. TL;DR most of it but a quick
scan shows they found problems with the memory model/atomics specification of RISC V. Luckily I don't think the atomics ISA extension is finalized yet so there is time to fix bugs or tighten up ambiguities in the spec.
Boy that is very complicated. What with juggling caches, pipelines, instruction reordering, shared memory, etc.
My brushes with multi-processor bugs were child's play by comparison.
Back in the early 1980's I worked on a project that multiple Intel 8085's sharing memory. The whole memory sharing thing had been implemented in TTL. I was the software guy but found myself debugging intermittent failures with scope and LSA. Turned out they were violating timing on asserting HOLD to the CPU. This only caused problems with chips made in one Intel FAB, others were OK! Adding a flop or two fixed that.
Later the Marconi company had developed their own 16 bit processor using AMD 2900 bit slice chips. Again intermittent problems struck our software. Again it's out with the scope and LSA. Turned out their LOCK EXCHANGE instruction did not work. Oh yes, said the hardware team, locks are not implemented on the version of processor you have there, have these new cards.
Grrr...
I sometimes day dream of putting multiple picorv32 on the DE0 nano. No shared memory. I want them hooked up with hardware comms channels.
I want to keep this as simple as possible so as to work with the ADC on the DE0 Nano. No slave mode needed. I also don't want to write a slave mode emulation for a test bench.
I'm going for a simple test bench that will check that enable, clock and data is coming out as per the DE0 manuals description. It can also inject some data bits at the right times to see if they get clocked in nicely.
Any way, that is on hold till I get my Makefile and C plus newlib building working properly.
http://www.excamera.com/sphinx/article-j1a-swapforth.html
Breaking news: "the first commit that enables MAX10 and Cyclone IV is already on the HEAD of the Yosys repo"
https://www.reddit.com/r/yosys/comments/64huh2/announcing_synthesis_for_intel_fpga_in_yosys_alpha/
https://github.com/cliffordwolf/yosys/blob/master/techlibs/altera_intel/synth_intel.cc
There is a flow included with picorv32. I probably missed something given you said 5280, but here are results for the 7680 LC part the HX8K.
picorv32/scripts/icestorm $ make example.bin
picorv32/scripts/icestorm $ icetime -tmd hx8k example.asc
Timing results: The Lattice breakout board seems like the cheapest way to get started (~$40 USD) with a decent sized part. Other boards do have more features, but are at least twice as expensive. It would be nice to add some NV storage to the breakout board.
Edited to add: if my calculations are correct the SPI FLASH on the breakout board is ~30X larger than needed. I wonder if there's a way to access it from the FPGA fabric? The iCE40 Programming and Configuration guide lists the required bitstream sizes.
Sadly the "No mul" option stops my "Hello world!" code from running. Not sure why, as far as I can tell it does not use a MUL anywhere!
For fun I tried using one of the ice40 SPI pins as an LED pin, and the ice storm tools didn't complain. If I look in the pinout spreadsheet, then I need to include those 4 pins to get to the maximum of 206 that's listed in the family selection guide table. So maybe the included SPI flash could be used. Do I want to shell out $40 to know for sure...?
Edited to add: I don't know how I missed this before, but in the iCE40 Programming and Configuration guide it clearly states: So this seems nice. If you put enough boot code into the config file, then you can probably use this built-in SPI FLASH to get multiple megabytes of storage. No expansion board necessary.
I tried replacing the tristate bus with a wired-or BUS as Chip suggested, it raised the Fmax to 57 MHz (from like 54),,, I was hoping for bigger savings.
The commands that compile the "hello world" are :
Now the CPU build without MUL works again for "Hello World".
So:
All changes checked in.
Also if it's really easy you might try instantiating multiple xoro_tops and see how that affects the Quartus runtime. At some utilization you'll probably hit the hockey stick part of the curve.
Here's something crazy that someone could do. Put picorv32 and a J1a in the same image. Write a RISC V assembler in SwapForth, allow the J1a to poke into the RISC V RAM. You would probably want one of the built-in "logic analyzers" as well to trace execution ;-)
I'm new to the game but I suspected that as you fill up the FPGA all that placing and routing gets harder and harder and takes longer and longer as it tries to squeeze things in.
I won't be trying any such experiments until I have my simple plans working. A proper main() "Hello world" being first on the list.
For sure I am not going anywhere near that Forth gibberish!
I have a DE0-Nano-SoC board in the garage which is a nice upgrade to that board. It's a Cyclone V instead of IV so this comparison could be a little off. For 25% more money it's got about 1.8X as many LEs and 4X as much embedded RAM. Oh - and a dual-core ARM running at 925 MHz which can boot linux from an SD card and has Ethernet. I did some little designs in it which interfaced to the ARMs, so my C code running on them could talk to the logic. I would have to resurrect my linux VM with all of those tools on it and remember how they work. The fabric can also access the SDRAM controller, so there's potentially a lot of off-chip RAM available. It would be cool if the RISC V tools (or Propeller 2!) could be run right on the embedded ARMs. But I always used cross compilation to build software for the ARMs. (Relatively easy to follow recipe)
The one thing about that DE0-Nano-SoC board is the Cyclone V gets hot even when not doing much. A friend has one as well and experiences the same thing. I wouldn't feel comfortable running it without a heatsink. (It doesn't come with one.)
The whole point here is to replace ARM with RISC V.
Obviously my picorv32 experiment is very far away from that.
Anyway, adding all that Linux complexity, with build tools running in VMs, etc, is really something I want to avoid. Been there done that.
I think, if you want to mix FPGA with Linux then hooking the Raspberry Pi to an FPGA is the better way to go. Enter the Icoboard!
On my list of things to do is SDRAM access on the Nano...we will see.
Edited to add: for anyone who really wants exposure to this stuff, Rocketboards is a nice resource. https://rocketboards.org/foswiki/view/Documentation/AVCVGSRD161
The ICO Board looks cool, though the 8-Mbit version isn't in stock. ~$100 though. Too bad Lattice doesn't subsidize this board.
No, I don't want to go there.
I have had my fill of boot loaders and obscure Linux distros. I have even built some of my own.
https://buildroot.org/
http://www.linuxfromscratch.org/
And some weird ideas vendors of ARM boards have that you need a VM image of their Linux + dev tools to develop for their boards.
Ick.
I want to plough my own simple furrow with Verilog on my Nano. Hopefully on Lattice and whatever else later.
One day I may catch up with AXI bus or Wishbone or whatever.