Shop OBEX P1 Docs P2 Docs Learn Events
FPGA based soft-CPU (distant relative of COG) — Parallax Forums

FPGA based soft-CPU (distant relative of COG)

nutsonnutson Posts: 242
edited 2009-02-09 21:00 in Propeller 1
For·experiments with my FPGA·board (DE-1,·Altera 2C20) I wanted·a soft-CPU that is·easy to program.·I have build·a·CPU that can·execute·a (very) limited set of·COG instruction codes,·so I can use·the propeller IDE to write·assembler programs. It has a 256x32 program memory, addresses >·$FF are reserved for I/O ports.·All·instructions·are 2 clocks: 1= fetch instruction,·write back previous result, 2= fetch data and·calculate.·The CPU· currently is less than 100 lines of Verilog code, the propstick control and test environment is another 50 lines, FPGA working environment is·Quartus 8.0SP1.

A·propstick SPIN·program downloads the soft_CPU·program, sets·the clock mode, and in manual clock·mode displays·internal CPU states·using PropTerminal (very usefull·program, Andy).·After ironing out·the·logic bugs this way, I ran·the thing at speed,·and have hit·80MHz (40 MIPS)·with·the simple testprogram shown already.·Far beyond expectation, there must be some·nasty timig bugs waiting for the me and the logic analyser.

Help needed:·Any experienced Verilog, Quartus users out there? I am a·a beginner with Quartus,·further steps·are beyond my current knowledge. Who wants to cooperate and help me now with:
-·optimizing·speed and reducing logic cell usage:
- extend the instruction set: I need·CALL/RTN, indexed or indirect adressing,·and some more:
- specify the I/O structure: ideally this would be a Wishbone interface,·opening·up access·to all the·www.opencores.org IP.

Creating a full specced COG is not my goal.·I want a simple controller to·perform I/O tasks,·example: an FPGA DDS synthesizer with the soft-CPU performing amplitude and frequency modulation.

Drop me a PM, tell me what you want, and I will send you the code (and some documentation if I can find the time to do that the next days)

Nico Hattink

PS There is one clear bug showing in the single step dug data: who is the first to see this?
768 x 576 - 95K
640 x 512 - 29K
808 x 724 - 197K
640 x 512 - 22K
«13

Comments

  • Cluso99Cluso99 Posts: 18,069
    edited 2008-11-07 12:22
    I have a Xilinx Spartan 3A by Avnet.

    If my memory is correct, Chip is designing the Prop II using an Altera Stratix III to verify the logic.


    Post Edited (Cluso99) : 11/7/2008 12:27:50 PM GMT
  • nutsonnutson Posts: 242
    edited 2008-11-07 12:40
    I am aware that Chip (Parallax) is using Altera Stratix FPGA's to develop the Propeller chips. Too bad I wasn' t around when they offered the first generation development tools for sale.

    nutson
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-11-07 13:39
    Maybe you can use a prop to sample the test data?? I note you have some sort of logic probe.
  • HannoHanno Posts: 1,130
    edited 2008-11-08 04:20
    nutson-
    Great job, looks very promising...
    Is the "clear error" that the "instruct" for the first 3 steps is wrong? After step 3, the PC matches the instruction, but step 0, 1 and 2 use a different value...
    Hanno
  • nutsonnutson Posts: 242
    edited 2008-11-08 07:28
    You got it, Hanno. The soft CPU reset logic·is not always funtioning properly, and sometimes the first instruction fetched·comes·from another adres as·PC=0. For the rest the instruction sequencing is OK. It probably has to do with the fact that·during reset I still have to apply a clock to the CPU, to enable the program memory being written by the propeller. The change over from it this "write" clock to·the "execution" clock has to be improved.·Name your price

    Cluso99: I have a good 500MHz logic analyzer, and·absolutely need this speed to eventually track down timing problems or·gliches, or analyze the worst case timing path in this CPU that is already clocked at 80Mhz.·Quartus has many facilities for analyzing timing during simulation and·execution but I have not mastered all of these.

    Nico
  • heaterheater Posts: 3,370
    edited 2008-11-08 11:23
    Very interesting.

    I started reading up on VHDL a while ago with a view to trying to create a COG as well. So far I can't justify the outlay for an FPGA board.

    The idea was that if one creates a COG (or indeed the whole Prop) in VHDL then not only do you have a CPU for FPGA but that code can also be run under GHDL ghdl.free.fr/ on Windows or Linux and so gives you a cycle accurate simulator for free !

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • nutsonnutson Posts: 242
    edited 2008-11-08 16:07
    There is a lot of interest lately for debuggers and simulators: users are taking the prop archtecture to the limit. As I have done my coding in Verilog, I cannot help you with your idea.

    I have had one reaction on my call for help, with the request to go for the full prop instruction set. The advantages would be: the soft_CPU potentially being able to execute SPIN programs, and drivers from OBEX. I am considering this, now that I have a basic CPU running, the goal seems less far away.

    Nico Hattink
  • heaterheater Posts: 3,370
    edited 2008-11-08 21:21
    I have not looked into verilog very much but I get the impression that it is a lot less of a hill to climb than learning VHDL.

    Of course if you have some instructions working already people would like to spur you on to complete the set. And then multiple COGs and the HUB and the and the IO and timers and ....

    I will be watching your progress with great interest. I'm curious to find out what is the smallest cheapest FPGA a single COG will fit into. And then the HUB etc etc

    Re: Running SPIN. The spin byte code interpreter has been publish in these forums so it seems quite doable. Hope Parallax does not mind to much.

    I like you idea about the wishbone interconnect. Prop with USB anyone?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • nutsonnutson Posts: 242
    edited 2008-11-08 21:43
    Attached see the current resource usage. This includes·512 x 32 register memory, the propeller interface and some debug functions.· Numbers I see on soft processors as NIOS, microblaze etc suggest that 500-1000 logic cells must be possible·(excluding program memory). My main problem at this stage is to learn more on Quartus, what logic has it generated, how to minimize logic cell use and improve speed.

    Nico Hattink
    768 x 576 - 31K
  • heaterheater Posts: 3,370
    edited 2008-11-08 22:43
    That does not seem at all bad. Tailoring this for some real application one might not implement all the COG RAM for all the COGs or not have all 8 COGs.

    How have you used up 283 pins ?

    Perhaps you could also add the multiply instruction without much ado, shame to waste all those free multipliers.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • nutsonnutson Posts: 242
    edited 2008-11-09 00:21
    My toplevel module is a template that defines all 2C20 pins connected to DE-1 board resources. I use only a few of them in the Verilog code (LED's. switches, GP_IO). Quartus sees the other pins (SRAM, DRAM etc) as unused. I can add a multiply instruction in minutes, just checked, the propeller IDE accepts the instruction.

    I started with FPGA's with this module http://www.elektor.com/magazines/2006/march/versatile-fpga-module.58036.lynkx mainly because it came with a 10 part course and design examples in VHDL. I lost interest when already in the third example an 8051 microprocessor was used as controller with (for me) incomprehensible C and assembly programs, and no toolchain for this software . When I browsed the Terasic design examples some time later, and found their Verilog code examples quite readable, I jumped onto Verilog. My feeling is that when you have done digital design on the gate and flip-flop level (7400 series), Verilog is more readable and understandable, you can sort of visualize in a timing diagram what happens with Verilog statements.
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-11-09 04:45
    I found a cheap intro to FPGA's with a kit from Avnet is US$39 using a Xilinx Spartan 3A XC3S400A-4FTG256C. It also comes with a Cypress CY3217 (see CY3210) MiniProg1 which is a USB dongle with an ISSP plug with an SPI interface. I would like to modify the internal code to make it prop-stick compatable, but no luck in getting enough info to do this yet (no info to restore it's function back to original if I corrupt it).

    http://www.em.avnet.com/evk/home/0,1707,RID%3D0%26CID%3D46501%26CCD%3DUSA%26SID%3D32214%26DID%3DDF2%26LID%3D32232%26PRT%3D0%26PVW%3D%26BID%3DDF2%26CTP%3DEVK,00.html

    I am finding Verilog much simpler than VHDL (I am only a beginner). Here is a good intro to Verilog http://www.asic-world.com/verilog/veritut.html

    Chip is using an Altera Stratix III for modelling the PropII. Presumably also the 64 I/O Prop I update.
  • heaterheater Posts: 3,370
    edited 2008-11-09 08:50
    Thanks for the heads up on the Spartan board though I'm a bit put of by it's limited availability in Europe and I was keen on an Altera device.

    I have heard it said that Verilog is more popular amongst hardware types and VHDL for softies. As a softie VHDL looks good to me but the difficulty I found was that it is a very big language and that a lot of what you may naturally want to write cannot be synthesized into an actual device. Many VHDL books and online tutorials don't emphasize this much so one ends up somewhat overwhelmed and confused.

    This was rectified by the discovery of "Circuit Design With VHDL" by Volnei Pedroni which concentrates on practical circuits, has lots of examples and is clear as a bell.

    The next hurdle is VHDLs strict type checking which is easily sorted with a quick read through www.synthworks.com/papers/vhdl_math_tricks_mapld_2003.pdf

    Another concise practical intro is ece.gmu.edu/courses/ECE545/viewgraphs_F04/loCarb_VHDL_small.pdf

    I remember looking over that Elector series and thinking it was a bit unworkable. Just now I'm drawn by the boards used here www.fpga4fun.com/ and available here www.knjn.com/

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • nutsonnutson Posts: 242
    edited 2008-11-09 09:23
    ·I have·used a·KJN Pluto with Altera 2C5, together with their·Flashy 2 channel 100MSPS converter. Work great together. Disadvantage: no SRAM on the board. I consider SRAM 256K16/32·essential on an FPGA board to think about·any serious signal processing. The Elektor board is good, it has·two SRAM banks.· What about this company http://www.dallaslogic.com/products.htm·or here http://www.jopdesign.com/·This thread http://forums.parallax.com/showthread.php?p=755070·could result in a PCB with prop, SRAM and CPLD (FPGA would be better)
  • heaterheater Posts: 3,370
    edited 2008-11-09 13:02
    Dallas and Jop boards get expensive. Especially with programming cables, adapter boards etc. I'm looking to get up and running for less than 100 euro.

    I'm eagerly awaiting Leon's Prop/CPLD board.

    Does anyone happen to know where to find the Altera Cyclone serial programming protocol?

    Lets say some mad guy wanted to hang a Cyclone off of his Propeller and get the Prop to configure it from an SD card file.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • LeonLeon Posts: 7,620
    edited 2008-11-09 16:09
    heater said...
    Dallas and Jop boards get expensive. Especially with programming cables, adapter boards etc. I'm looking to get up and running for less than 100 euro.

    I'm eagerly awaiting Leon's Prop/CPLD board.

    Does anyone happen to know where to find the Altera Cyclone serial programming protocol?

    Lets say some mad guy wanted to hang a Cyclone off of his Propeller and get the Prop to configure it from an SD card file.

    I ought to finish that off. I'm rather preoccupied with the XMOS chips, though.

    BTW, an XMOS chip can probably emulate Propeller cogs in software faster than the real thing, and a lot faster than an FPGA.

    Leon

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Amateur radio callsign: G1HSM
    Suzuki SV1000S motorcycle
  • BaggersBaggers Posts: 3,019
    edited 2008-11-09 16:57
    Leon, I very much doubt an XMOS could emulate a Prop cog faster than the real thing!

    100mips into 20mips is 5 instructions
    first instruction read instruction
    2nd instruction and bits for jump table
    3rd instruction jump
    4th instruction and for source
    5th instruction read source
    nope, deffo can't emulate it faster than the real thing :P

    Baggers

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    http://www.propgfx.co.uk/forum/·home of the PropGFX Lite

    ·
  • LeonLeon Posts: 7,620
    edited 2008-11-09 17:07
    Each XMOS core actually runs at 400 MIPS. If one core isn't fast enough it could use two or more, running in parallel.

    Leon

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Amateur radio callsign: G1HSM
    Suzuki SV1000S motorcycle

    Post Edited (Leon) : 11/9/2008 5:13:18 PM GMT
  • OwenSOwenS Posts: 173
    edited 2008-11-09 17:27
    Yes, but from my understanding, the threaded pipeline means that each thread in a core executes at 100MIPS max
  • LeonLeon Posts: 7,620
    edited 2008-11-09 17:30
    No, a single thread will run at 400 MIPS.

    Leon

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Amateur radio callsign: G1HSM
    Suzuki SV1000S motorcycle
  • BaggersBaggers Posts: 3,019
    edited 2008-11-09 17:34
    No Leon, a single thread runs at 100MIPS, up to 4 run at 100Mips then it gets reduced per thread

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    http://www.propgfx.co.uk/forum/·home of the PropGFX Lite

    ·
  • OwenSOwenS Posts: 173
    edited 2008-11-09 17:40
    Aah; a slight misunderstanding. It's 400MIPS with one thread, which drops to 100MIPS with 2 to 4 running

    Edit: Maybe not? www.xlinkers.org/forum/viewtopic.php?f=3&t=127&p=658&hilit=mips#p660
  • LeonLeon Posts: 7,620
    edited 2008-11-09 18:23
    You are correct, I forgot about that. I think that one can have four threads running at 100 MHz each, which makes up for it. They can be switched in one clock.

    Leon

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Amateur radio callsign: G1HSM
    Suzuki SV1000S motorcycle
  • nutsonnutson Posts: 242
    edited 2008-11-09 18:26
    I have read the same as Baggers: each thread 100 MIPS upto 4 threads, then going down to 50 MIPS/thread with 8 threads active. Still respectable. And no deterministic timing without external synchronization.
  • LeonLeon Posts: 7,620
    edited 2008-11-09 18:41
    Anyway, it should be possible to emulate a cog on each core, using four threads.

    Leon

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Amateur radio callsign: G1HSM
    Suzuki SV1000S motorcycle
  • BaggersBaggers Posts: 3,019
    edited 2008-11-09 18:51
    yup [noparse]:)[/noparse] they are quick I'll give you that, but IMHO PropII will have it beat hands down [noparse]:D[/noparse]

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    http://www.propgfx.co.uk/forum/·home of the PropGFX Lite

    ·
  • BaggersBaggers Posts: 3,019
    edited 2008-11-09 18:54
    Leon, I personally don't think you'll be able to emulate a single cog even on 4 threads faster than the real thing [noparse]:D[/noparse]
    Sharing data back n forth will take too long, but if you still feel that strongly about it, go for it [noparse]:D[/noparse] and prove me wrong, I'd gladly eat humble pie.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    http://www.propgfx.co.uk/forum/·home of the PropGFX Lite

    ·
  • heaterheater Posts: 3,370
    edited 2008-11-09 19:15
    It's a max of 100MIPS per thread so if we can emulate one prop instruction in 5 XMOS instruction we're in. I seriously don't think that is going to happen so we'll never get to Prop speed.

    Next killer is the RAM. If you want to emulate the COG memory for multiple COGs and the HUB RAM, well it's just going to get stuck.

    I can just see it now, running my 8080 emulator in an emulated Prop on an XS1 !

    We had better quit this talk of that "other" company before we get our wrists slapped again[noparse]:)[/noparse]

    Surely the FPGA implementation could get up to Prop speed. At least for a single COG.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • AribaAriba Posts: 2,690
    edited 2008-11-09 19:19
    I second the doubts of baggers and heater.
    And why should an XMOS be faster than an FPGA? With an FPGA you can do all in parallel what has to be sequential decoded on a CPU.

    There are a lot of 32bit CPU designs for FPGAs and they run with up to 100 MIPS and more.

    Andy
Sign In or Register to comment.