RISC V ?

1568101120

Comments

  • KeithEKeithE Posts: 910
    edited April 2017 Vote Up0Vote Down
    I quickly hacked something together to try it out. Here are some size results, but I had to edit memory.v to reduce the memory from 48 to 12 kbytes. And multiple files to get rid of the tristate bus. (And I just disabled the PLL)
    IOs          15 / 206
    GBs          0 / 8
      GB_IOs     0 / 8
    LCs          7164 / 7680
      DFF        1091
      CARRY      531
      CARRY, DFF 213
      DFF PASS   423
      CARRY PASS 17
    BRAMs        28 / 32
    WARMBOOTs    0 / 1
    PLLs         0 / 2
    ...
    After placement:
    PIOs       18 / 206
    PLBs       952 / 960
    BRAMs      28 / 32
    

    Not yet sure how to get a timing report.

    Based on the time that it takes to run this on an old x86 laptop I think that the Pi is out as a vehicle for this magnitude of hardware development. (The simulation and waveform viewing is fine though.) It would be interesting to know how long it takes on a Surface since Quartus was able do it in a minute. This is an old x86 MacBook - I think that first x86 laptop that Apple produced.
  • jmg wrote: »
    ... but lattice add a large white arrow on the silkscreen in case you overlook their part :)
    That's definitely a chuckle right there.
    "Are we alone in the universe?"
    "Yes," said the Oracle.
    "So there's no other life out there?"
    "There is. They're alone too."
  • evanh wrote: »
    jmg wrote: »
    ... but lattice add a large white arrow on the silkscreen in case you overlook their part :)
    That's definitely a chuckle right there.
    I used to work for Weitek, and some companies would advertise their own systems in various trade publications, and include a shot of the PCB. Eventually some of them started blurring the Weitek chips because they were too prominent. Good free advertising while it lasted. They were BIG chips.
    504 x 508 - 252K
  • KeithEKeithE Posts: 910
    edited April 2017 Vote Up0Vote Down
    I didn't want to give up too easily, so I tried disabling a few things. Looks like it helps a lot.

    Disabling BARREL_SHIFTER
    After packing:
    LCs          6805 / 7680
      DFF        1070
      CARRY      542
      CARRY, DFF 209
      DFF PASS   407
      CARRY PASS 21
    ...
    After placement:
    PLBs       945 / 960
    
    And disabling ENABLE_DIV
    After packing:
    LCs          6355 / 7680
      DFF        866
      CARRY      324
      CARRY, DFF 213
      DFF PASS   341
      CARRY PASS 16
    ...
    After placement:
    PLBs       919 / 960
    
    And disabling ENABLE_FAST_MUL - the big payoff.
    After packing:
    LCs          2759 / 7680
      DFF        772
      CARRY      323
      CARRY, DFF 156
      DFF PASS   256
      CARRY PASS 16
    ...
    After placement:
    PLBs       490 / 960
    
    And ENABLE_PCPI - why not?
    After packing:
    IOs          15 / 206
    GBs          0 / 8
      GB_IOs     0 / 8
    LCs          2743 / 7680
      DFF        772
      CARRY      292
      CARRY, DFF 156
      DFF PASS   255
      CARRY PASS 16
    ...
    After placement:
    PLBs       475 / 960
    

    Anyways - all of this helps a lot with runtime, and leaves room for logic outside of the CPU. I'll have to give this a shot on the Pi after all.

    Edited to add: then if I run:
    icetime -tmd hx8k example.asc
    
    I get:
    Total path delay: 24.68 ns (40.52 MHz)
    
    For the even simpler example script that's included in picorv32 I get:
    Total number of logic levels: 32
    Total path delay: 15.33 ns (65.22 MHz)
    
  • ersmith wrote: »
    They also have RISC-V development boards (with a real RISC-V ASIC, not an FPGA).

    I can't find a price, or full data, on the FE310 chip, (or the larger U500) but it mentions a few things....

    "The FE310 features SiFive’s E31 CPU Coreplex, a 32-bit RV32IMAC core running at 320+ MHz. Additional features include a 16KB L1 Instruction Cache, a 16KB Data SRAM scratchpad, hardware multiply/divide, debug module, one-time programmable non-volatile memory (OTP), flexible clock generation with on-chip oscillators and PLLs, and a wide variety of peripherals including UARTs, QSPI, PWMs and timers. Multiple power domains and a low-power standby mode ensure a variety of applications can benefit from the FE310, which was fabricated in TSMC 180nm.
    ... a dedicated off-chip Quad-SPI flash controller for execute-in-place, 8 KiB of in-circuit programmable OTP memory, 8 KiB of mask ROM,
    "

    That 180nm process is the same as P2, right (but at OnSemi, not TSMC for P2) ?
    The good news there is if 180nm can give 320MHz, maybe 160MHz for P2 is not going to slip down ?
    XIP sounds good too..
  • KeithEKeithE Posts: 910
    edited April 2017 Vote Up0Vote Down
    Looks like the icoboard guys have a Pi image with the tools already built. Why didn't I look there earlier?

    http://icoboard.org/get-started-with-your-icoboard-and-a-raspi.html
    and:
    https://github.com/cliffordwolf/icotools

    Heater - there's some verilog and scripts in there are well, so you can compare your approach with Wolf's. There's even an "On-chip logic analyzer" and a simple SPI master.

    Edited to add: to look at this it's best to build an example. e.g. in icosoc/examples/hello run make testbench_vcd. You see this error:
    icosoc.v:51: error: NULL port declarations are not allowed.
    icosoc.mk:106: recipe for target 'testbench' failed
    
    This is a typical trailing comma error in verilog. Just edit isosoc.mk to eliminate this trailing comma:
    51:    output SRAM_CE, SRAM_WE, SRAM_OE, SRAM_LB, SRAM_UB, HRAM_CK,
    52 );
    
    Run make again, and you can see "Hello World!" printed out. I just aborted the simulation and looked in isosoc.v to see how everything is wired.
  • The example in scripts icestorm subdir seems to be tailored for small parts. I don't think that it would fit in the 1 k LUTs version:
    Design Information
    
    Command line:   map -a MachXO2 -p LCMXO2-7000HE -t TQFP144 -s 4 -oc Commercial
         picorv_picorv.ngd -o picorv_picorv_map.ncd -pr picorv_picorv.prf -mp
         picorv_picorv.mrp -lpf C:/02_Elektronik/044_RISCV/07_PICORV/02_Implementati
         ons/02_MachXO2/picorv/picorv_picorv_synplify.lpf -lpf C:/02_Elektronik/044_
         RISCV/07_PICORV/02_Implementations/02_MachXO2/picorv.lpf -c 0 -gui 
    Target Vendor:  LATTICE
    Target Device:  LCMXO2-7000HETQFP144
    Target Performance:   4
    Mapper:  xo2c00,  version:  Diamond (64-bit) 3.6.0.83.4
    Mapped on:  04/15/17  06:55:06
    
    
    Design Summary
       Number of registers:    757 out of  7209 (11%)
          PFU registers:          749 out of  6864 (11%)
          PIO registers:            8 out of   345 (2%)
       Number of SLICEs:       854 out of  3432 (25%)
          SLICEs as Logic/ROM:    710 out of  3432 (21%)
          SLICEs as RAM:          144 out of  2574 (6%)
          SLICEs as Carry:        125 out of  3432 (4%)
       Number of LUT4s:        1672 out of  6864 (24%)
          Number used as logic LUTs:        1134
          Number used as distributed RAM:   288
          Number used as ripple logic:      250
          Number used as shift registers:     0
       Number of PIO sites used: 9 + 4(JTAG) out of 115 (11%)
       Number of block RAMs:  5 out of 26 (19%)
    
    
  • Ale wrote: »
    The example in scripts icestorm subdir seems to be tailored for small parts. I don't think that it would fit in the 1 k LUTs version:

    If anyone finds a Lattice tool-flow version for iCE40 I could try that into the 5280 LUT part.
    The iCE40 LUTs seems not quite the same as MachXO LUTs.

    You would also need to add a QuadSPI / HyperFLASH XIP block for the smallest pin count parts.

  • The iCE40 LUTs seems not quite the same as MachXO LUTs.

    And perfectly working code for the MachXO2 synthesized with synplify doesn't always synthesize with synplify for the iCE series, no idea why :(
  • Heater.Heater. Posts: 20,939
    edited April 2017 Vote Up0Vote Down
    KeithE,

    Thanks for the heads up on the Memory Model Verification paper. TL;DR most of it but a quick
    scan shows they found problems with the memory model/atomics specification of RISC V. Luckily I don't think the atomics ISA extension is finalized yet so there is time to fix bugs or tighten up ambiguities in the spec.

    Boy that is very complicated. What with juggling caches, pipelines, instruction reordering, shared memory, etc.

    My brushes with multi-processor bugs were child's play by comparison.

    Back in the early 1980's I worked on a project that multiple Intel 8085's sharing memory. The whole memory sharing thing had been implemented in TTL. I was the software guy but found myself debugging intermittent failures with scope and LSA. Turned out they were violating timing on asserting HOLD to the CPU. This only caused problems with chips made in one Intel FAB, others were OK! Adding a flop or two fixed that.

    Later the Marconi company had developed their own 16 bit processor using AMD 2900 bit slice chips. Again intermittent problems struck our software. Again it's out with the scope and LSA. Turned out their LOCK EXCHANGE instruction did not work. Oh yes, said the hardware team, locks are not implemented on the version of processor you have there, have these new cards.
    Grrr...

    I sometimes day dream of putting multiple picorv32 on the DE0 nano. No shared memory. I want them hooked up with hardware comms channels.

  • Thank's Chip, and others, for the comments on SPI.

    I want to keep this as simple as possible so as to work with the ADC on the DE0 Nano. No slave mode needed. I also don't want to write a slave mode emulation for a test bench.

    I'm going for a simple test bench that will check that enable, clock and data is coming out as per the DE0 manuals description. It can also inject some data bits at the right times to see if they get clocked in nicely.

    Any way, that is on hold till I get my Makefile and C plus newlib building working properly.
  • Heater. wrote: »
    KeithE,

    Thanks for the heads up on the Memory Model Verification paper. TL;DR most of it but a quick
    scan shows they found problems with the memory model/atomics specification of RISC V. Luckily I don't think the atomics ISA extension is finalized yet so there is time to fix bugs or tighten up ambiguities in the spec.

    Boy that is very complicated. What with juggling caches, pipelines, instruction reordering, shared memory, etc.

    My brushes with multi-processor bugs were child's play by comparison.

    Back in the early 1980's I worked on a project that multiple Intel 8085's sharing memory. The whole memory sharing thing had been implemented in TTL. I was the software guy but found myself debugging intermittent failures with scope and LSA. Turned out they were violating timing on asserting HOLD to the CPU. This only caused problems with chips made in one Intel FAB, others were OK! Adding a flop or two fixed that.

    Later the Marconi company had developed their own 16 bit processor using AMD 2900 bit slice chips. Again intermittent problems struck our software. Again it's out with the scope and LSA. Turned out their LOCK EXCHANGE instruction did not work. Oh yes, said the hardware team, locks are not implemented on the version of processor you have there, have these new cards.
    Grrr...

    I sometimes day dream of putting multiple picorv32 on the DE0 nano. No shared memory. I want them hooked up with hardware comms channels.
    How many will fit on a DE0?

  • With the MUL, DIV and barrel shifter left out the whole build comes to 8% of the nano. So eight cores should fit. They would be bit tight for RAM though.
  • Heater. wrote: »
    With the MUL, DIV and barrel shifter left out the whole build comes to 8% of the nano. So eight cores should fit. They would be bit tight for RAM though.
    Wow! That's quite a few. Sounds like I should be firing up my DE0 Nano! :-)

  • Heater. wrote: »
    With the MUL, DIV and barrel shifter left out the whole build comes to 8% of the nano. So eight cores should fit. They would be bit tight for RAM though.
    How much savings do you see leaving those out in the Cyclone IV? I was wondering if multiplication might be more efficiently implemented there than in the Lattice iCE40 parts.
    Ale wrote:
    The example in scripts icestorm subdir seems to be tailored for small parts. I don't think that it would fit in the 1 k LUTs version
    Even the 16-bit J1a forth CPU is a tight fit for the 1k.

    http://www.excamera.com/sphinx/article-j1a-swapforth.html
  • Ramon wrote: »
    Yosys can map to ASIC (standard cell) and also other FPGAs (ice40, Xilinx-7, gowin, and greenpak)

    Breaking news: "the first commit that enables MAX10 and Cyclone IV is already on the HEAD of the Yosys repo"

    https://www.reddit.com/r/yosys/comments/64huh2/announcing_synthesis_for_intel_fpga_in_yosys_alpha/
    https://github.com/cliffordwolf/yosys/blob/master/techlibs/altera_intel/synth_intel.cc
  • KeithEKeithE Posts: 910
    edited April 2017 Vote Up0Vote Down
    jmg wrote: »
    If anyone finds a Lattice tool-flow version for iCE40 I could try that into the 5280 LUT part.
    The iCE40 LUTs seems not quite the same as MachXO LUTs.

    You would also need to add a QuadSPI / HyperFLASH XIP block for the smallest pin count parts.

    There is a flow included with picorv32. I probably missed something given you said 5280, but here are results for the 7680 LC part the HX8K.

    picorv32/scripts/icestorm $ make example.bin
    picorv32/scripts/icestorm $ icetime -tmd hx8k example.asc
    After packing:
    IOs          9 / 206
    GBs          0 / 8
      GB_IOs     0 / 8
    LCs          1982 / 7680
      DFF        635
      CARRY      280
      CARRY, DFF 5
      DFF PASS   209
      CARRY PASS 7
    BRAMs        6 / 32
    WARMBOOTs    0 / 1
    PLLs         0 / 2
    ...
    After placement:
    PIOs       13 / 206
    PLBs       366 / 960
    BRAMs      6 / 32
    
    Timing results:
    Total number of logic levels: 32
    Total path delay: 15.33 ns (65.22 MHz)
    
    The Lattice breakout board seems like the cheapest way to get started (~$40 USD) with a decent sized part. Other boards do have more features, but are at least twice as expensive. It would be nice to add some NV storage to the breakout board.
  • KeithEKeithE Posts: 910
    edited April 2017 Vote Up0Vote Down
    KeithE wrote: »
    The Lattice breakout board seems like the cheapest way to get started (~$40 USD) with a decent sized part. Other boards do have more features, but are at least twice as expensive. It would be nice to add some NV storage to the breakout board.
    Just wanted to mention that the iCE40 internal nonvolatile configuration memory is one-time programmable. So the breakout boards docs don't even mention using it.

    Edited to add: if my calculations are correct the SPI FLASH on the breakout board is ~30X larger than needed. I wonder if there's a way to access it from the FPGA fabric? The iCE40 Programming and Configuration guide lists the required bitstream sizes.
  • Heater.Heater. Posts: 20,939
    edited April 2017 Vote Up0Vote Down
    KeithE,
    How much savings do you see leaving those out in the Cyclone IV? I was wondering if multiplication might be more efficiently implemented there than in the Lattice iCE40 parts.
    Here is what I am seeing on the nano (In order of removing things) :
    Everything:        2717 LE, 1276 regs, 12%
    No barrel shifter: 2564 LE, 1219 regs, 11%
    No div:            2143 LE, 1019 regs, 10%
    No mul:            1964 LE,  928 regs,  9% 
    

    Sadly the "No mul" option stops my "Hello world!" code from running. Not sure why, as far as I can tell it does not use a MUL anywhere!


  • KeithEKeithE Posts: 910
    edited April 2017 Vote Up0Vote Down
    Heater. wrote: »
    KeithE,
    Sadly the "No mul" option stops my "Hello world!" code from running. Not sure why, as far as I can tell it does not use a MUL anywhere!
    Did you use the right compiler? The one in the directory riscv32i (no m or c in the name)? Not that it explains why code without a multiplication wouldn't build correctly.

    For fun I tried using one of the ice40 SPI pins as an LED pin, and the ice storm tools didn't complain. If I look in the pinout spreadsheet, then I need to include those 4 pins to get to the maximum of 206 that's listed in the family selection guide table. So maybe the included SPI flash could be used. Do I want to shell out $40 to know for sure...?

    Edited to add: I don't know how I missed this before, but in the iCE40 Programming and Configuration guide it clearly states:
    After configuration, the SPI port pins are available to the user-application as additional PIO pins, supplied by the VCC_SPI input voltage.
    So this seems nice. If you put enough boot code into the config file, then you can probably use this built-in SPI FLASH to get multiple megabytes of storage. No expansion board necessary.
  • Those are probably double function pins, after configuration they can be used as user IOs.

    I tried replacing the tristate bus with a wired-or BUS as Chip suggested, it raised the Fmax to 57 MHz (from like 54),,, I was hoping for bigger savings.
  • Hmm..my toolchain is installed in /opt/riscv32i/bin


    The commands that compile the "hello world" are :
    $ make firmware/firmware.hex
    /opt/riscv32i/bin/riscv32-unknown-elf-gcc -c -march=rv32i -Os --std=c99 -Wall -Wextra -Wshadow -Wundef -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wredundant-decls -Wstrict-prototypes -Wmissing-prototypes -pedantic  -ffreestanding -nostdlib -o firmware/helloWorld.o firmware/helloWorld.c
    /opt/riscv32i/bin/riscv32-unknown-elf-gcc -Os -ffreestanding -nostdlib -o firmware/firmware.elf \
                    -Wl,-Bstatic,-T,firmware/sections.lds,-Map,firmware/firmware.map,--strip-debug \
                    firmware/start.o firmware/irq.o firmware/print.o firmware/sieve.o firmware/helloWorld.o firmware/multest.o firmware/stats.o  -lgcc
    chmod -x firmware/firmware.elf
    /opt/riscv32i/bin/riscv32-unknown-elf-objcopy -O binary firmware/firmware.elf firmware/firmware.bin
    chmod -x firmware/firmware.bin
    python3 firmware/makehex.py firmware/firmware.bin 12288 firmware > firmware/firmware.hex
    
  • Hmm...I removed running the sieve test from the firmware.

    Now the CPU build without MUL works again for "Hello World".

    So:
    No mul:            1957 LE,  926 regs,  9% 
    

    All changes checked in.
  • Heater. wrote: »
    TAny way, that is on hold till I get my Makefile and C plus newlib building working properly.
    Do any of the icoboard examples do this? I'll be curious to see what you come up with.
    heater wrote:
    Now the CPU build without MUL works again for "Hello World".
    So you've always been using riscv32i? Then it makes no sense to me at all.

    Also if it's really easy you might try instantiating multiple xoro_tops and see how that affects the Quartus runtime. At some utilization you'll probably hit the hockey stick part of the curve.

    Here's something crazy that someone could do. Put picorv32 and a J1a in the same image. Write a RISC V assembler in SwapForth, allow the J1a to poke into the RISC V RAM. You would probably want one of the built-in "logic analyzers" as well to trace execution ;-)
  • KeithE,

    I'm new to the game but I suspected that as you fill up the FPGA all that placing and routing gets harder and harder and takes longer and longer as it tries to squeeze things in.

    I won't be trying any such experiments until I have my simple plans working. A proper main() "Hello world" being first on the list.

    For sure I am not going anywhere near that Forth gibberish!







  • I don't think a DE0-Nano compile will ever take more than 6 minutes on an iX7 machine.
  • KeithEKeithE Posts: 910
    edited April 2017 Vote Up0Vote Down
    That low run time is pretty cool - what utilization have you hit?

    I have a DE0-Nano-SoC board in the garage which is a nice upgrade to that board. It's a Cyclone V instead of IV so this comparison could be a little off. For 25% more money it's got about 1.8X as many LEs and 4X as much embedded RAM. Oh - and a dual-core ARM running at 925 MHz which can boot linux from an SD card and has Ethernet. I did some little designs in it which interfaced to the ARMs, so my C code running on them could talk to the logic. I would have to resurrect my linux VM with all of those tools on it and remember how they work. The fabric can also access the SDRAM controller, so there's potentially a lot of off-chip RAM available. It would be cool if the RISC V tools (or Propeller 2!) could be run right on the embedded ARMs. But I always used cross compilation to build software for the ARMs. (Relatively easy to follow recipe)

    The one thing about that DE0-Nano-SoC board is the Cyclone V gets hot even when not doing much. A friend has one as well and experiences the same thing. I wouldn't feel comfortable running it without a heatsink. (It doesn't come with one.)
  • KeithE,
    Oh - and a dual-core ARM running at 925 MHz
    Noooo.....

    The whole point here is to replace ARM with RISC V.

    Obviously my picorv32 experiment is very far away from that.

    Anyway, adding all that Linux complexity, with build tools running in VMs, etc, is really something I want to avoid. Been there done that.

    I think, if you want to mix FPGA with Linux then hooking the Raspberry Pi to an FPGA is the better way to go. Enter the Icoboard!

    On my list of things to do is SDRAM access on the Nano...we will see.













  • KeithEKeithE Posts: 910
    edited April 2017 Vote Up0Vote Down
    Heater. wrote: »
    KeithE,
    Anyway, adding all that Linux complexity, with build tools running in VMs, etc, is really something I want to avoid. Been there done that.
    That board can be an "educational" experience - you get your money's worth. Fortunately there are a lot of somewhat aging demos and documentation. But you'll get exposure to Das-UBoot, an obscure linux "distro", having to map physical memory to user space, using QSys to setup interconnect between the FPGA fabric and the hard-processer system (one little dot can mean a bunch of logic including multi-bit clock domain crossing), the AXI bus (you can easily convert to APB which is simple, or whatever Altera calls the other option), cross compilation, and probably stuff that I forgot about. One user wrote up some stuff here - https://digibird1.wordpress.com/playing-with-the-cyclone-v-soc-system-de0-nano-soc-kitatlas-soc/

    Edited to add: for anyone who really wants exposure to this stuff, Rocketboards is a nice resource. https://rocketboards.org/foswiki/view/Documentation/AVCVGSRD161

    The ICO Board looks cool, though the 8-Mbit version isn't in stock. ~$100 though. Too bad Lattice doesn't subsidize this board.
  • Ouch, my eyes!

    No, I don't want to go there.

    I have had my fill of boot loaders and obscure Linux distros. I have even built some of my own.
    https://buildroot.org/
    http://www.linuxfromscratch.org/

    And some weird ideas vendors of ARM boards have that you need a VM image of their Linux + dev tools to develop for their boards.

    Ick.

    I want to plough my own simple furrow with Verilog on my Nano. Hopefully on Lattice and whatever else later.

    One day I may catch up with AXI bus or Wishbone or whatever.



Sign In or Register to comment.