FPGA based soft-CPU (distant relative of COG)

nutson · 2009-02-09 11:49

Heater, correct. I think it is feasible to speed up the·LMM kernel in an FPGA based COG·without losing the capability of it executing PASM code
- the·FPGA CPU·can be·faster (50 MIPS+)
- The LMM 3 instruction cycle·(fetch from hub,·increment PC,·execute)·can be reduced by creating a single fetch and·execute·instruction that executes in 4-5 clocks depending on memory configuration.
- several of the LMM FCALL functions could·be speeded up by creating "hardwired"·instructions for that (PUSH, POP etc)

I still·would need to implement conditional execution and correct C and Z handling, but for far less instructions. And I would need·byte/word/long adressing of·FPGA or board·memory.

Nico Hattink

Cluso99 · 2009-02-09 12:31

Nico, I have done all the C and Z handling (checking before execution) but not setting after instruction execution yet. In fact, it is very simple due to the regular instruction set. And because it can be decoded at least in the 2nd clock it is possible to short-circuit the instruction and go fetch the next (meaning 2 cycle if fails condition / nop).

Re StratixIII, I was thinking Parallax could make a minimal pcb with the StratixIII fpga on it for a lot less than the demo boards of that level. It seems the cheapest Stratix III is about $500 ea. Ouch!!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps (SixBladeProp)
· Prop Tools under Development or Completed (Index)
· Emulators (Micros eg Altair, and Terminals eg VT100) - index
· Search the Propeller forums (via Google)

My cruising website is: ·www.bluemagic.biz

nutson · 2009-02-09 12:34

The better answer is: Having a COG execute LMM code very fast is feasible, by creating specialised (macro) instructions, but the COG would remain in control.

The next step, have a hardwired engine, that takes the Image craft compiler output as an instruction set and executes it directly is more complicated. Especially because it is a mix of PASM instructions, and FCALL functions, these would have to be executed as multiword instructions. I have no ideas in that direction at the moment.

My problem is not to have the fastest possible C machine, my problem is: what is the best solution for executing high level language programs on my my embedded CPU. SPIN? Looks complicated to me, would have to implement a fairly complete COG model to be able to run the SPIN interpreter. Have my COG execute C-LMM compiler code looks more feasible at this moment. But there may be other options: I looked at JAVA bytecode, maybe Hippy has some compiler that outputs easy to interpret intermediate code in his large box filled with magical tricks.

heater · 2009-02-09 14:15

I think the idea of implementing an FPGA COG is great. And if you could add some method of turbo charging the LMM loop that would be fantastic.

I guess to do that you would have to adapt some currently unused COG opcode/s to implement the LMM loop and/or kernel functions, FCALL etc. At which point you run into possible conflicts with whatever Chip is doing for Prop II. But here and now that does not matter. Just that you may have to change it later to remain compatible later.

If you boil the problem down to "I want a CPU in my FPGA that runs a compiled high level language" then you would not be here. You would already have used one of the many freely available cores, MIPs, x86, Leon whatever that work with GCC and hence C, C++, Pascal, Fortran are available.

However if the problem is now "I want to use MY cog-like core to run a high level language" then we are in the world of ImageCraft C and LMM, or SPIN. No idea how much of a COG implementation the Java engine might need.

As an outside option one could implement a CPU core that executes spin byte codes directly. I guess you don't want to go there though.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

rjo_ · 2009-02-09 15:18

nutson,

Just PM it to me... Parallax has my permission to monitor my PM's and that way they can get a live feed of your work.

I have a DE1... and two of the camera modules[noparse]:)[/noparse]

Very interesting stuff you are doing... might be more of an audience than you can imagine[noparse]:)[/noparse]

Rich

rjo_ · 2009-02-09 15:21

my goal was simply to be able to trigger the cameras and download the data to a shared SD card... I'm miles away. You are right there!!!

hippy · 2009-02-09 18:40

nutson said...
I think I can make a pretty fast LMM execution engine.

Isn't that really just the same as a normal Cog but pulling its opcodes from somewhere other than Cog RAM with some extra opcode handling ?

In which case it looks entirely feasible. By not needing the written-in-PASM fetch-execute loop that gives an instant factor of four increase in speed.

In fact you could implement the functionality as a new PASM opcode 'exlong reg' ...

1) Fetch (Hub) memory long pointed to be 'reg', increment 'reg'
2) Execute loaded long as a PASM opcode ( enhanced for LMM use as needs be )
3) Keep PC the same unless there was an 'unhandled' exception or 'lmmstop' opcode when it continues at PC+1
4) Repeat

You could also add an 'exword reg', does the same but takes a 16-bit PASM-thumb opcode, expands that to a 32-bit PASM instruction and executes it. Double your code capacity in theory.

Both are what I'd like to see added to the Propeller anyway.

Not sure you get much gain though with a single Cog as it's no better than any other single-core micro which runs as fast and has a C compiler for it. A single Cog Propeller compares pretty poorly to most other micros on a number of counts, but its multi-Cog nature gives a real advantage over them.

nutson · 2009-02-09 18:42

rjo_ did you check your PM? I have the desigfiles ready for you, 1MB zipped, I can't PM that to you.

Nico

rjo_ · 2009-02-09 19:11

nope...sorry... will reply now.

nutson · 2009-02-09 20:32

@Hippy: Yes, these things are feasible when one controls the hardware architecture. I see three·problems;

- there are no free opcodes afaik (4 reserved), only some space left in the 000011 group. Thats one reason I am hesitant·to go for a complete COG implementation, I want to (mis)use·existing opcodes for new functions as Multiply/Accumulate.

-· Affordable·FPGAs have plenty of logic·for·multiple COG's, but·are·limited by the available on-chip dual port RAMs. The·Cyclone2C25 on the DE-1 board has 52·blocks 512 x 8, I need 4 /COG so I could implement 12 COGs (and·have nothing left).·I am thinking of doing a test with 3.·The CycloneIII 3C25 board has 66 blocks 1K8.·I could do 8 COGs with 32k shared memory on that.

- Main problem is, that with multiple COG's,·shared access hub memory·will·limit the·bandwith/COG, especially if timing determinism must be kept.·Only way out is to make hub memory wider, what Chip is doing in PropII: multiple long access in on hub cycle.

Nico Hattink

Ale · 2009-02-09 21:00

Just use the reserved ones, like MUL for a multiply

. It is your design after all, and a custom compiler (like BST) can be used if necessary, I'm sure Bard can accommodate a few instructions without too much problems. Or it is mpark homespun compiler also available, he may add those opcodes if needed. Well, I have no idea if they would do it, but guessing from their past approachability

.

FPGA based soft-CPU (distant relative of COG)

Comments