CMM in hardware?

David Betz · 2014-08-08 14:11

I know it's crazy but I've been wondering how practical it would be to implement the PropGCC CMM engine in hardware. Mostly, it unpacks a byte stream into PASM instructions that it executes one at a time. The idea would be to fetch a long at a time from hub memory and step through bytes decoding CMM instructions and building a PASM instruction which would be executed as soon as it was complete. I've attached a document describing the PropGCC CMM instruction set. Tell me if I'm insane! :-)

CMM.txt

By the way, the advantage of this approach is that there is already a PropGCC code generator for this instruction set.

Cluso99 · 2014-08-08 14:43

David,
You're insane

As indeed most of us are here

Now, let me go look at what you posted.

David Betz · 2014-08-08 14:48

Cluso99: Thanks for the encouragement! :-)

Cluso99 · 2014-08-08 14:58

I think that's doable. Be fun trying anyway.

David Betz · 2014-08-08 14:59

Cluso99 wrote: »

I think that's doable. Be fun trying anyway.

I'm going to at least make a stab at it although I have very little Verilog experience. Just one class in college.

Cluso99 · 2014-08-08 15:05

That's one class more than most of us, me thinks.

overclocked · 2014-08-08 15:09

Is it fair to say that this is just a seperate instruction set, so actually it has nothing to do with the original Propeller?
I second that, a good hack. Probably no problem at all adding it.
Just make sure to implement simulation so you can try your core within the development environment. It will save you time and be good to have a test-platform later.

Good luck!

David Betz · 2014-08-08 15:11

Cluso99 wrote: »

That's one class more than most of us, me thinks.

I have a number of FPGA boards that I bought with the idea of trying to run the Verilog code I wrote for a simple processor. I never got around to it of course. The DE0-Nano I'm using for the P1 source is one of those boards. I had it before the P2 image became available for it. Chip releasing that image and this P1 source code gave me an opportunity to dig it out and actually use it. Maybe I should try my CPU code as well! :-)

jazzed · 2014-08-08 16:16

overclocked wrote: »

Is it fair to say that this is just a seperate instruction set, so actually it has nothing to do with the original Propeller?
I second that, a good hack. Probably no problem at all adding it.

The CMM instruction set is closely related to Propeller PASM but up to 65% smaller than comparable LMM*. It's like a thumb mode except that most of the instructions get decoded into PASM by a COG core virtual machine which is relatively slow.

The idea seems to be to get rid of the PASM virtual machine by changing the COG (as an experiment for a CMM-native propeller) so that all the time-consuming decoding gets done in the COG core hardware (for the set of instructions that are used most often). I don't think that any optimizations for HUB window stalls have been considered, and they probably don't need to be in a first cut at least.

Advantages:

1. Higher code density over LMM*
2. Speed on par with LMM* for non-fcached code second only to COG PASM speed
3. CMM allows block read/execute of up to 64 PASM longs at COG speed
4. Getting rid of the VM saves global HUB RAM space.
5. A proven compiler already exists that runs the CMM code

Disadvantages:

1. Bigger COGs?
2. Development time.

*Notes: LMM was suggested by Bill Henning as a way to read and execute PASM code from HUB RAM one long at a time. Fcache is a larger fill and execute mechanism Bill mentioned as a higher performance overlay method. The only problem with LMM is the macros needed to do certain things and the fetch window stall problem. Macros used for LMM have been considered for a HUB-EXEC mechanism, and auto increment index variants of RDLONG have been considered to fix the fetch window stall problem.

Bob Lawrence (VE1RLL) · 2014-08-08 17:14

re: I have very little Verilog experience. Just one class in college.

That's two more than me

jmg · 2014-08-08 21:40

David Betz wrote: »

.. Tell me if I'm insane! :-)

By the way, the advantage of this approach is that there is already a PropGCC code generator for this instruction set.

Not insane, the idea has strong merit.
The PropGCC code generator is a serious plus.

It probably needs a pause/enable signal added to the Prop opcode fetch, so it can idle clocks until the CMM reader is ready. ( I think such a signal is already needed for the WAIT opcodes. )

When this works from HUB, it should be possible to expand it to stream from a QuadSPI device

David Betz · 2014-08-09 04:33

jmg wrote: »

Not insane, the idea has strong merit.
The PropGCC code generator is a serious plus.

It probably needs a pause/enable signal added to the Prop opcode fetch, so it can idle clocks until the CMM reader is ready. ( I think such a signal is already needed for the WAIT opcodes. )

When this works from HUB, it should be possible to expand it to stream from a QuadSPI device

Anything is possible. We have the source code. We have the technology!
I guess the problem is that the resulting COG might be more then double the size of the current COG so you'd need a much bigger FPGA to get a P1+CMM with 8 COGs.

David Betz · 2014-08-10 08:04

I just remembered something about the CMM instruction set that might make this project less useful.

$4y   brw: y is the condition for the branch, next 2 bytes are address

Macro Instructions
$06   lcall <word>: call subroutine
      mov __TMP0, #<word>
      call #_LMM_CALL_INDIRECT

CMM uses 16 bit code addresses since on the real P1 there isn't really any need to address more than 32K of memory. This was going to be a problem when porting CMM to P2 as well but it will require at least minor changes to the CMM code generator in PropGCC in order to allow it to address the larger hub memory people have been talking about. We could, of course, handle 64k perfectly well but more than that will require us to extend the immediate addresses to at least three bytes and hence will require tools changes.

CMM in hardware?

Comments