RISC V ?

KeithE · 2017-04-14 19:31

I quickly hacked something together to try it out. Here are some size results, but I had to edit memory.v to reduce the memory from 48 to 12 kbytes. And multiple files to get rid of the tristate bus. (And I just disabled the PLL)

IOs          15 / 206
GBs          0 / 8
  GB_IOs     0 / 8
LCs          7164 / 7680
  DFF        1091
  CARRY      531
  CARRY, DFF 213
  DFF PASS   423
  CARRY PASS 17
BRAMs        28 / 32
WARMBOOTs    0 / 1
PLLs         0 / 2
...
After placement:
PIOs       18 / 206
PLBs       952 / 960
BRAMs      28 / 32

Not yet sure how to get a timing report.

Based on the time that it takes to run this on an old x86 laptop I think that the Pi is out as a vehicle for this magnitude of hardware development. (The simulation and waveform viewing is fine though.) It would be interesting to know how long it takes on a Surface since Quartus was able do it in a minute. This is an old x86 MacBook - I think that first x86 laptop that Apple produced.

evanh · 2017-04-14 22:27

jmg wrote: »

... but lattice add a large white arrow on the silkscreen in case you overlook their part

That's definitely a chuckle right there.

KeithE · 2017-04-14 22:54

evanh wrote: »

jmg wrote: »

... but lattice add a large white arrow on the silkscreen in case you overlook their part

That's definitely a chuckle right there.

I used to work for Weitek, and some companies would advertise their own systems in various trade publications, and include a shot of the PCB. Eventually some of them started blurring the Weitek chips because they were too prominent. Good free advertising while it lasted. They were BIG chips.

KeithE · 2017-04-15 00:32

I didn't want to give up too easily, so I tried disabling a few things. Looks like it helps a lot.

Disabling BARREL_SHIFTER

After packing:
LCs          6805 / 7680
  DFF        1070
  CARRY      542
  CARRY, DFF 209
  DFF PASS   407
  CARRY PASS 21
...
After placement:
PLBs       945 / 960

And disabling ENABLE_DIV

After packing:
LCs          6355 / 7680
  DFF        866
  CARRY      324
  CARRY, DFF 213
  DFF PASS   341
  CARRY PASS 16
...
After placement:
PLBs       919 / 960

And disabling ENABLE_FAST_MUL - the big payoff.

After packing:
LCs          2759 / 7680
  DFF        772
  CARRY      323
  CARRY, DFF 156
  DFF PASS   256
  CARRY PASS 16
...
After placement:
PLBs       490 / 960

And ENABLE_PCPI - why not?

After packing:
IOs          15 / 206
GBs          0 / 8
  GB_IOs     0 / 8
LCs          2743 / 7680
  DFF        772
  CARRY      292
  CARRY, DFF 156
  DFF PASS   255
  CARRY PASS 16
...
After placement:
PLBs       475 / 960

Anyways - all of this helps a lot with runtime, and leaves room for logic outside of the CPU. I'll have to give this a shot on the Pi after all.

Edited to add: then if I run:

icetime -tmd hx8k example.asc

I get:

Total path delay: 24.68 ns (40.52 MHz)

For the even simpler example script that's included in picorv32 I get:

Total number of logic levels: 32
Total path delay: 15.33 ns (65.22 MHz)

jmg · 2017-04-15 01:09

ersmith wrote: »

They also have RISC-V development boards (with a real RISC-V ASIC, not an FPGA).

I can't find a price, or full data, on the FE310 chip, (or the larger U500) but it mentions a few things....

"The FE310 features SiFive’s E31 CPU Coreplex, a 32-bit RV32IMAC core running at 320+ MHz. Additional features include a 16KB L1 Instruction Cache, a 16KB Data SRAM scratchpad, hardware multiply/divide, debug module, one-time programmable non-volatile memory (OTP), flexible clock generation with on-chip oscillators and PLLs, and a wide variety of peripherals including UARTs, QSPI, PWMs and timers. Multiple power domains and a low-power standby mode ensure a variety of applications can benefit from the FE310, which was fabricated in TSMC 180nm.
... a dedicated off-chip Quad-SPI flash controller for execute-in-place, 8 KiB of in-circuit programmable OTP memory, 8 KiB of mask ROM,
"

That 180nm process is the same as P2, right (but at OnSemi, not TSMC for P2) ?
The good news there is if 180nm can give 320MHz, maybe 160MHz for P2 is not going to slip down ?
XIP sounds good too..

KeithE · 2017-04-15 01:53

Looks like the icoboard guys have a Pi image with the tools already built. Why didn't I look there earlier?

http://icoboard.org/get-started-with-your-icoboard-and-a-raspi.html
and:
https://github.com/cliffordwolf/icotools

Heater - there's some verilog and scripts in there are well, so you can compare your approach with Wolf's. There's even an "On-chip logic analyzer" and a simple SPI master.

Edited to add: to look at this it's best to build an example. e.g. in icosoc/examples/hello run make testbench_vcd. You see this error:

icosoc.v:51: error: NULL port declarations are not allowed.
icosoc.mk:106: recipe for target 'testbench' failed

This is a typical trailing comma error in verilog. Just edit isosoc.mk to eliminate this trailing comma:

51:    output SRAM_CE, SRAM_WE, SRAM_OE, SRAM_LB, SRAM_UB, HRAM_CK,
52 );

Run make again, and you can see "Hello World!" printed out. I just aborted the simulation and looked in isosoc.v to see how everything is wired.

Ale · 2017-04-15 05:00

The example in scripts icestorm subdir seems to be tailored for small parts. I don't think that it would fit in the 1 k LUTs version:

Design Information

Command line:   map -a MachXO2 -p LCMXO2-7000HE -t TQFP144 -s 4 -oc Commercial
     picorv_picorv.ngd -o picorv_picorv_map.ncd -pr picorv_picorv.prf -mp
     picorv_picorv.mrp -lpf C:/02_Elektronik/044_RISCV/07_PICORV/02_Implementati
     ons/02_MachXO2/picorv/picorv_picorv_synplify.lpf -lpf C:/02_Elektronik/044_
     RISCV/07_PICORV/02_Implementations/02_MachXO2/picorv.lpf -c 0 -gui 
Target Vendor:  LATTICE
Target Device:  LCMXO2-7000HETQFP144
Target Performance:   4
Mapper:  xo2c00,  version:  Diamond (64-bit) 3.6.0.83.4
Mapped on:  04/15/17  06:55:06


Design Summary
   Number of registers:    757 out of  7209 (11%)
      PFU registers:          749 out of  6864 (11%)
      PIO registers:            8 out of   345 (2%)
   Number of SLICEs:       854 out of  3432 (25%)
      SLICEs as Logic/ROM:    710 out of  3432 (21%)
      SLICEs as RAM:          144 out of  2574 (6%)
      SLICEs as Carry:        125 out of  3432 (4%)
   Number of LUT4s:        1672 out of  6864 (24%)
      Number used as logic LUTs:        1134
      Number used as distributed RAM:   288
      Number used as ripple logic:      250
      Number used as shift registers:     0
   Number of PIO sites used: 9 + 4(JTAG) out of 115 (11%)
   Number of block RAMs:  5 out of 26 (19%)

jmg · 2017-04-15 06:46

Ale wrote: »

The example in scripts icestorm subdir seems to be tailored for small parts. I don't think that it would fit in the 1 k LUTs version:

If anyone finds a Lattice tool-flow version for iCE40 I could try that into the 5280 LUT part.
The iCE40 LUTs seems not quite the same as MachXO LUTs.

You would also need to add a QuadSPI / HyperFLASH XIP block for the smallest pin count parts.

Ale · 2017-04-15 07:37

The iCE40 LUTs seems not quite the same as MachXO LUTs.

And perfectly working code for the MachXO2 synthesized with synplify doesn't always synthesize with synplify for the iCE series, no idea why

Heater. · 2017-04-15 11:51

KeithE,

Thanks for the heads up on the Memory Model Verification paper. TL;DR most of it but a quick
scan shows they found problems with the memory model/atomics specification of RISC V. Luckily I don't think the atomics ISA extension is finalized yet so there is time to fix bugs or tighten up ambiguities in the spec.

Boy that is very complicated. What with juggling caches, pipelines, instruction reordering, shared memory, etc.

My brushes with multi-processor bugs were child's play by comparison.

Back in the early 1980's I worked on a project that multiple Intel 8085's sharing memory. The whole memory sharing thing had been implemented in TTL. I was the software guy but found myself debugging intermittent failures with scope and LSA. Turned out they were violating timing on asserting HOLD to the CPU. This only caused problems with chips made in one Intel FAB, others were OK! Adding a flop or two fixed that.

Later the Marconi company had developed their own 16 bit processor using AMD 2900 bit slice chips. Again intermittent problems struck our software. Again it's out with the scope and LSA. Turned out their LOCK EXCHANGE instruction did not work. Oh yes, said the hardware team, locks are not implemented on the version of processor you have there, have these new cards.
Grrr...

I sometimes day dream of putting multiple picorv32 on the DE0 nano. No shared memory. I want them hooked up with hardware comms channels.

Heater. · 2017-04-15 12:13

Thank's Chip, and others, for the comments on SPI.

I want to keep this as simple as possible so as to work with the ADC on the DE0 Nano. No slave mode needed. I also don't want to write a slave mode emulation for a test bench.

I'm going for a simple test bench that will check that enable, clock and data is coming out as per the DE0 manuals description. It can also inject some data bits at the right times to see if they get clocked in nicely.

Any way, that is on hold till I get my Makefile and C plus newlib building working properly.

David Betz · 2017-04-15 13:19

Heater. wrote: »

KeithE,

Thanks for the heads up on the Memory Model Verification paper. TL;DR most of it but a quick
scan shows they found problems with the memory model/atomics specification of RISC V. Luckily I don't think the atomics ISA extension is finalized yet so there is time to fix bugs or tighten up ambiguities in the spec.

Boy that is very complicated. What with juggling caches, pipelines, instruction reordering, shared memory, etc.

My brushes with multi-processor bugs were child's play by comparison.

Back in the early 1980's I worked on a project that multiple Intel 8085's sharing memory. The whole memory sharing thing had been implemented in TTL. I was the software guy but found myself debugging intermittent failures with scope and LSA. Turned out they were violating timing on asserting HOLD to the CPU. This only caused problems with chips made in one Intel FAB, others were OK! Adding a flop or two fixed that.

Later the Marconi company had developed their own 16 bit processor using AMD 2900 bit slice chips. Again intermittent problems struck our software. Again it's out with the scope and LSA. Turned out their LOCK EXCHANGE instruction did not work. Oh yes, said the hardware team, locks are not implemented on the version of processor you have there, have these new cards.
Grrr...

I sometimes day dream of putting multiple picorv32 on the DE0 nano. No shared memory. I want them hooked up with hardware comms channels.

How many will fit on a DE0?

Heater. · 2017-04-15 14:15

With the MUL, DIV and barrel shifter left out the whole build comes to 8% of the nano. So eight cores should fit. They would be bit tight for RAM though.

David Betz · 2017-04-15 14:43

Heater. wrote: »

With the MUL, DIV and barrel shifter left out the whole build comes to 8% of the nano. So eight cores should fit. They would be bit tight for RAM though.

Wow! That's quite a few. Sounds like I should be firing up my DE0 Nano! :-)

KeithE · 2017-04-15 15:03

Heater. wrote: »

With the MUL, DIV and barrel shifter left out the whole build comes to 8% of the nano. So eight cores should fit. They would be bit tight for RAM though.

How much savings do you see leaving those out in the Cyclone IV? I was wondering if multiplication might be more efficiently implemented there than in the Lattice iCE40 parts.

Ale wrote:

The example in scripts icestorm subdir seems to be tailored for small parts. I don't think that it would fit in the 1 k LUTs version

Even the 16-bit J1a forth CPU is a tight fit for the 1k.

http://www.excamera.com/sphinx/article-j1a-swapforth.html

Ramon · 2017-04-15 15:17

Ramon wrote: »

Yosys can map to ASIC (standard cell) and also other FPGAs (ice40, Xilinx-7, gowin, and greenpak)

Breaking news: "the first commit that enables MAX10 and Cyclone IV is already on the HEAD of the Yosys repo"

https://www.reddit.com/r/yosys/comments/64huh2/announcing_synthesis_for_intel_fpga_in_yosys_alpha/
https://github.com/cliffordwolf/yosys/blob/master/techlibs/altera_intel/synth_intel.cc

KeithE · 2017-04-15 15:20

jmg wrote: »

If anyone finds a Lattice tool-flow version for iCE40 I could try that into the 5280 LUT part.
The iCE40 LUTs seems not quite the same as MachXO LUTs.

You would also need to add a QuadSPI / HyperFLASH XIP block for the smallest pin count parts.

There is a flow included with picorv32. I probably missed something given you said 5280, but here are results for the 7680 LC part the HX8K.

picorv32/scripts/icestorm $ make example.bin
picorv32/scripts/icestorm $ icetime -tmd hx8k example.asc

After packing:
IOs          9 / 206
GBs          0 / 8
  GB_IOs     0 / 8
LCs          1982 / 7680
  DFF        635
  CARRY      280
  CARRY, DFF 5
  DFF PASS   209
  CARRY PASS 7
BRAMs        6 / 32
WARMBOOTs    0 / 1
PLLs         0 / 2
...
After placement:
PIOs       13 / 206
PLBs       366 / 960
BRAMs      6 / 32

Timing results:

Total number of logic levels: 32
Total path delay: 15.33 ns (65.22 MHz)

The Lattice breakout board seems like the cheapest way to get started (~$40 USD) with a decent sized part. Other boards do have more features, but are at least twice as expensive. It would be nice to add some NV storage to the breakout board.

KeithE · 2017-04-15 15:35

KeithE wrote: »

The Lattice breakout board seems like the cheapest way to get started (~$40 USD) with a decent sized part. Other boards do have more features, but are at least twice as expensive. It would be nice to add some NV storage to the breakout board.

Just wanted to mention that the iCE40 internal nonvolatile configuration memory is one-time programmable. So the breakout boards docs don't even mention using it.

Edited to add: if my calculations are correct the SPI FLASH on the breakout board is ~30X larger than needed. I wonder if there's a way to access it from the FPGA fabric? The iCE40 Programming and Configuration guide lists the required bitstream sizes.

Heater. · 2017-04-15 18:04

KeithE,

How much savings do you see leaving those out in the Cyclone IV? I was wondering if multiplication might be more efficiently implemented there than in the Lattice iCE40 parts.

Here is what I am seeing on the nano (In order of removing things) :

Everything:        2717 LE, 1276 regs, 12%
No barrel shifter: 2564 LE, 1219 regs, 11%
No div:            2143 LE, 1019 regs, 10%
No mul:            1964 LE,  928 regs,  9%

Sadly the "No mul" option stops my "Hello world!" code from running. Not sure why, as far as I can tell it does not use a MUL anywhere!

KeithE · 2017-04-15 18:16

Heater. wrote: »

KeithE,
Sadly the "No mul" option stops my "Hello world!" code from running. Not sure why, as far as I can tell it does not use a MUL anywhere!

Did you use the right compiler? The one in the directory riscv32i (no m or c in the name)? Not that it explains why code without a multiplication wouldn't build correctly.

For fun I tried using one of the ice40 SPI pins as an LED pin, and the ice storm tools didn't complain. If I look in the pinout spreadsheet, then I need to include those 4 pins to get to the maximum of 206 that's listed in the family selection guide table. So maybe the included SPI flash could be used. Do I want to shell out $40 to know for sure...?

Edited to add: I don't know how I missed this before, but in the iCE40 Programming and Configuration guide it clearly states:

After configuration, the SPI port pins are available to the user-application as additional PIO pins, supplied by the VCC_SPI input voltage.

So this seems nice. If you put enough boot code into the config file, then you can probably use this built-in SPI FLASH to get multiple megabytes of storage. No expansion board necessary.

Ale · 2017-04-15 18:27

Those are probably double function pins, after configuration they can be used as user IOs.

I tried replacing the tristate bus with a wired-or BUS as Chip suggested, it raised the Fmax to 57 MHz (from like 54),,, I was hoping for bigger savings.

Heater. · 2017-04-15 18:35

Hmm..my toolchain is installed in /opt/riscv32i/bin

The commands that compile the "hello world" are :

$ make firmware/firmware.hex
/opt/riscv32i/bin/riscv32-unknown-elf-gcc -c -march=rv32i -Os --std=c99 -Wall -Wextra -Wshadow -Wundef -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wredundant-decls -Wstrict-prototypes -Wmissing-prototypes -pedantic  -ffreestanding -nostdlib -o firmware/helloWorld.o firmware/helloWorld.c
/opt/riscv32i/bin/riscv32-unknown-elf-gcc -Os -ffreestanding -nostdlib -o firmware/firmware.elf \
                -Wl,-Bstatic,-T,firmware/sections.lds,-Map,firmware/firmware.map,--strip-debug \
                firmware/start.o firmware/irq.o firmware/print.o firmware/sieve.o firmware/helloWorld.o firmware/multest.o firmware/stats.o  -lgcc
chmod -x firmware/firmware.elf
/opt/riscv32i/bin/riscv32-unknown-elf-objcopy -O binary firmware/firmware.elf firmware/firmware.bin
chmod -x firmware/firmware.bin
python3 firmware/makehex.py firmware/firmware.bin 12288 firmware > firmware/firmware.hex

Heater. · 2017-04-15 18:44

Hmm...I removed running the sieve test from the firmware.

Now the CPU build without MUL works again for "Hello World".

So:

No mul:            1957 LE,  926 regs,  9%

All changes checked in.

KeithE · 2017-04-15 19:02

Heater. wrote: »

TAny way, that is on hold till I get my Makefile and C plus newlib building working properly.

Do any of the icoboard examples do this? I'll be curious to see what you come up with.

heater wrote:

Now the CPU build without MUL works again for "Hello World".

So you've always been using riscv32i? Then it makes no sense to me at all.

Also if it's really easy you might try instantiating multiple xoro_tops and see how that affects the Quartus runtime. At some utilization you'll probably hit the hockey stick part of the curve.

Here's something crazy that someone could do. Put picorv32 and a J1a in the same image. Write a RISC V assembler in SwapForth, allow the J1a to poke into the RISC V RAM. You would probably want one of the built-in "logic analyzers" as well to trace execution ;-)

Heater. · 2017-04-15 19:40

KeithE,

I'm new to the game but I suspected that as you fill up the FPGA all that placing and routing gets harder and harder and takes longer and longer as it tries to squeeze things in.

I won't be trying any such experiments until I have my simple plans working. A proper main() "Hello world" being first on the list.

For sure I am not going anywhere near that Forth gibberish!

cgracey · 2017-04-15 23:08

I don't think a DE0-Nano compile will ever take more than 6 minutes on an iX7 machine.

KeithE · 2017-04-15 23:29

That low run time is pretty cool - what utilization have you hit?

I have a DE0-Nano-SoC board in the garage which is a nice upgrade to that board. It's a Cyclone V instead of IV so this comparison could be a little off. For 25% more money it's got about 1.8X as many LEs and 4X as much embedded RAM. Oh - and a dual-core ARM running at 925 MHz which can boot linux from an SD card and has Ethernet. I did some little designs in it which interfaced to the ARMs, so my C code running on them could talk to the logic. I would have to resurrect my linux VM with all of those tools on it and remember how they work. The fabric can also access the SDRAM controller, so there's potentially a lot of off-chip RAM available. It would be cool if the RISC V tools (or Propeller 2!) could be run right on the embedded ARMs. But I always used cross compilation to build software for the ARMs. (Relatively easy to follow recipe)

The one thing about that DE0-Nano-SoC board is the Cyclone V gets hot even when not doing much. A friend has one as well and experiences the same thing. I wouldn't feel comfortable running it without a heatsink. (It doesn't come with one.)

Heater. · 2017-04-15 23:53

KeithE,

Oh - and a dual-core ARM running at 925 MHz

Noooo.....

The whole point here is to replace ARM with RISC V.

Obviously my picorv32 experiment is very far away from that.

Anyway, adding all that Linux complexity, with build tools running in VMs, etc, is really something I want to avoid. Been there done that.

I think, if you want to mix FPGA with Linux then hooking the Raspberry Pi to an FPGA is the better way to go. Enter the Icoboard!

On my list of things to do is SDRAM access on the Nano...we will see.

KeithE · 2017-04-16 01:14

Heater. wrote: »

KeithE,
Anyway, adding all that Linux complexity, with build tools running in VMs, etc, is really something I want to avoid. Been there done that.

That board can be an "educational" experience - you get your money's worth. Fortunately there are a lot of somewhat aging demos and documentation. But you'll get exposure to Das-UBoot, an obscure linux "distro", having to map physical memory to user space, using QSys to setup interconnect between the FPGA fabric and the hard-processer system (one little dot can mean a bunch of logic including multi-bit clock domain crossing), the AXI bus (you can easily convert to APB which is simple, or whatever Altera calls the other option), cross compilation, and probably stuff that I forgot about. One user wrote up some stuff here - https://digibird1.wordpress.com/playing-with-the-cyclone-v-soc-system-de0-nano-soc-kitatlas-soc/

Edited to add: for anyone who really wants exposure to this stuff, Rocketboards is a nice resource. https://rocketboards.org/foswiki/view/Documentation/AVCVGSRD161

The ICO Board looks cool, though the 8-Mbit version isn't in stock. ~$100 though. Too bad Lattice doesn't subsidize this board.

Heater. · 2017-04-16 01:38

Ouch, my eyes!

No, I don't want to go there.

I have had my fill of boot loaders and obscure Linux distros. I have even built some of my own.
https://buildroot.org/
http://www.linuxfromscratch.org/

And some weird ideas vendors of ARM boards have that you need a VM image of their Linux + dev tools to develop for their boards.

Ick.

I want to plough my own simple furrow with Verilog on my Nano. Hopefully on Lattice and whatever else later.

One day I may catch up with AXI bus or Wishbone or whatever.

RISC V ?

Comments