Demands on Chip after August + and some possible design goals

David Betz · 2014-08-12 15:42

Cluso99 wrote: »
David,
Here's my understanding...

In cog.v there are 5 states, m0-m4. The normal cycle is m0..m4 and are the normal states we refer to in the P1...
For the current instruction
m0: Fetch instruction S contents
m1: Fetch instruction D contents
m2: Execute instruction & Fetch next instruction
m3: Writeback Result
m0: .....etc
Now,
m4: Is a wait caused by the execution of a waitx instruction (eg waitcnt, waitpeq, etc)

Don't forget these are overlapped stages for the 2 instructions that overlap that we used to refer to as
I d S D e R
        I  d  S  D  e  R
Hopefully you can now follow what happens in each state as coded in the // process and following sections.

Does this help?

Thanks! Yes, that helps.But what does "d" mean, decode instruction?

Cluso99 · 2014-08-12 15:49

d=decode instruction and e=execute instruction. R=write result (optional depending on r bit in the instruction).
I=fetch instruction, S=fetch the source, D=fetch the destination.
Note S may not be required - ie it is immediate mode (i bit)
Note2 When INA (or other special registers) are the Source, they are supplied in the execute stage IIRC.

Quite a lot of this was discussed a couple of years back on the prop forum. kuroneko was the master here.

Heater. · 2014-08-12 16:13

I haven't really understood the question posed in this thread but it seems to me that:

1) Somebody at Parallax has to "own" the P1 Open Source project. Especially if there is any likely hood that a future enhanced version of it might make it into silicon at Parallax's expense. As far as I can tell that someone can only be Chip. Do correct me if I'm wrong.

2) There is a lot of work that can be done on the current design whilst keeping it totally true to the Propeller as it exists in silicon today. Adding support for different FPGA boards. Making it usable on Zilinx devices. Possible VHDL version(s). General bug fix and code clean up. And so on.

3) With that out of the way I might guess that the direction of any future P1++ development will be determined by the "community". I hope Chip is busy building the P2.
The "community" will realistically be about three guys (if that) who have the skill and vision and tenacity to create something useful and make it actually work. The rest of us will be full of good suggestions though:)

4) Whilst there may be a lot of community developments going on with various people taking the P1 design in directions they personally like. I don't think Parallax should concern itself with hosting or supporting all of that. Parallax as the "owner" of the project should adopt and foster what they feel may realistically end up in a future chip. That 64 bit Propeller I have in my plans may not be viable for Parallax realistically:)

jmg · 2014-08-12 16:23

Heater. wrote: »

... That 64 bit Propeller I have in my plans may not be viable for Parallax realistically:)

'With Verilog Source, a first pass 64 bit Propeller is a few lines edited...

There is another stepping stone I think worth considering :

On most FPGAs a 36b version comes almost for free. (no nett RAM cost)
It also provides a way to add more opcodes, but keep 100% binary compatibility with existing P1 code.

Heater. · 2014-08-12 16:30

36 bits you say...

Ok that's 4 more than 32 bits.

That means that each src and dst field can be 11 bits instead of 9.

That means we can address 2048 registers instead of just 512.

That sounds totally great to me already. I don't need any more opcodes. And for free you said?

No idea what we do with the HUB though. Keep it as 32 bit wide store?

Yanomani · 2014-08-12 16:41

jmg

With those two extra bits, you also can enjoy a four task behaviour, selecting wich program counter, flag set and register bank you're using.
It's enough to latch the incoming values during the fetch cycle, but taking precautions to not mess to any pending write from a previous cycle( I'm not sure if this happens at all).
Or split them, two hardware tasks, two instruction sets.
It's up to you to decide and play fun!

Yanomani
P.S. Bad eyes here, four bits, i.e. four hardware tasks, four instruction sets, simultaneously....

jmg wrote: »

'With Verilog Source, a first pass 64 bit Propeller is a few lines edited...

There is another stepping stone I think worth considering :

On most FPGAs a 36b version comes almost for free. (no nett RAM cost)
It also provides a way to add more opcodes, but keep 100% binary compatibility with existing P1 code.

Bill Henning · 2014-08-12 16:48

Hub could go 36 bit as well.

Heck we may as well define:

BYTE = 9 bits
WORD = 18 bits
LONG = 36 bits

Shades of DEC's past!

Heater. wrote: »

36 bits you say...

Ok that's 4 more than 32 bits.

That means that each src and dst field can be 11 bits instead of 9.

That means we can address 2048 registers instead of just 512.

That sounds totally great to me already. I don't need any more opcodes. And for free you said?

No idea what we do with the HUB though. Keep it as 32 bit wide store?

Cluso99 · 2014-08-12 16:54

1. P1.5 has to be backward compatible.
2. There are easier ways to increase the cog ram and keep compatibility (I will be trying this out very soooon now).

I guess we should get an idea from Parallax what geometry node the P1.5 would be intended?
What is the current P1 fabbed in - 360nm???

Yanomani · 2014-08-12 16:59

Heater

One of those bits is just crying loud to solve the immediate 32 bits constant load dilemma!

Yanomani

Heater. wrote: »

36 bits you say...

Ok that's 4 more than 32 bits.

That means that each src and dst field can be 11 bits instead of 9.

That means we can address 2048 registers instead of just 512.

That sounds totally great to me already. I don't need any more opcodes. And for free you said?

No idea what we do with the HUB though. Keep it as 32 bit wide store?

jmg · 2014-08-12 17:07

Cluso99 wrote: »

1. P1.5 has to be backward compatible.

It depends a little what is meant by "backward compatible".
To most, it means able to run P1 code, and binary compatible means able to load a P1 Binary file.
However, some might take "backward compatible" to mean all P1V code has to also run on P1, which basically means no extensions at all.

I would prefer words like 'Superset Compatible' as it makes it clear that old code can run on new chips, but new code cannot run on old chips.
Something like mapping the 4 extra bits to Address extension would be 100% backward compatible,and easy to implement even conditionally in source code.

It just needs a 32->36 binary loader for existing binaries, and the tools need to create 36b for new (ASM only initially?) code.

Addit : to add another variable into the mix, I see the Cyclone V parts bump from the usual/common 9b FPGA memory, to 10b, so they have M10K = 512 x 40b Memory. Maybe that loader needs 32 -> 36 or 40 choice

On a Cyclone V, you can get EIGHT more bits ..

Heater. · 2014-08-12 17:18

Yanomani,

One of those bits is just crying loud to solve the immediate 32 bits constant load dilemma!

I have never had a "32 bits constant load dilemma"

But as I have just used all the new free bits to extend the COG to 2048 registers there is plenty of room to keep constants in and reference them with the 11 bit fields.

There is even less immediate constant load dilemma:)

Yanomani · 2014-08-12 18:11

Heater

My mistake!
I was trying to get a ride in your comments about the number of free bits and messed it up!
Sorry!

Yanomani

Heater. wrote: »

Yanomani,

I have never had a "32 bits constant load dilemma"

But as I have just used all the new free bits to extend the COG to 2048 registers there is plenty of room to keep constants in and reference them with the 11 bit fields.

There is even less immediate constant load dilemma:)

John A. Zoidberg · 2014-08-12 18:52

Like the Nios in the Altera FPGAs, the Prop1 could benefit from optionally connecting the rest of the memory to SRAM/SDRAM. We can dump all the math tables and the character ROM all inside that same place. And ditto to the program space and whatever it is inside. The only one problem I can guess is, I'm not sure if multiple cogs accessing the same location of RAM is permissable in FPGA emulation of P1 since these external memory are not multi-ported.

The RAMs inside the FPGA is normally multi-ported, so it's very very easy to just dump the P1 core inside the FPGA. However, many FPGAs sold are quite small in space, therefore an external memory is needed.

It may sound very silly but I'm pondering about this possibility since I've heard about Nios (or Microblaze) inside the FPGAs.

David Betz · 2014-08-12 19:50

jmg wrote: »

'With Verilog Source, a first pass 64 bit Propeller is a few lines edited...

There is another stepping stone I think worth considering :

On most FPGAs a 36b version comes almost for free. (no nett RAM cost)
It also provides a way to add more opcodes, but keep 100% binary compatibility with existing P1 code.

How do you expand the word size to 36 bits and still maintain 100% binary compatibility?

David Betz · 2014-08-12 19:55

Heater. wrote: »

The "community" will realistically be about three guys (if that) who have the skill and vision and tenacity to create something useful and make it actually work.

You might be right but I'd like to think that I could make a contribution at some level even if it is something simple. I imagine others feel the same. However, it's good to know there are some really smart people here who already know how to do this stuff and don't have to climb the relatively steep learning curve.

jmg · 2014-08-12 20:55

David Betz wrote: »

How do you expand the word size to 36 bits and still maintain 100% binary compatibility?

The upper 4 bits are default zeroed, and a simple downloader packs any 32 bit binaries with 4b Zeros.

Any tools that support the extended options, set the upper 4 bits as needed.
(ie they create 36b fields.)
Of course If the files are true binary, some means to identify the two formats is needed.

Result is a binary compatible super set. All existing code has no idea those 4 bits exist.

__red__ · 2014-08-12 21:02

Heater. wrote: »

1) Somebody at Parallax has to "own" the P1 Open Source project. Especially if there is any likely hood that a future enhanced version of it might make it into silicon at Parallax's expense. As far as I can tell that someone can only be Chip. Do correct me if I'm wrong.

I understand exactly where you're coming from Heater but managing OSS is a full-time job. The BDFL model may fit well here.

BDFL, or "Benevolent Dictator For Life: is an honorarium given to project founders or leaders in large OSS projects. Examples include Linus, Guido and Wall.

Picking on the two projects I know most about, they operate similarly but with a slightly different focus. It all depends on how deep Chip want to be in the day-to-day.

Firstly, linux kernel.

The linux kernel has multiple branches and areas of the kernel have maintainers. When you write a patch, you don't submit it to Linus, you submit it to the maintainer and then it eventually filters up the tree to Linus who not only has final say but also does the branch merging and packaging himself. This is Linus' full time job. Linus has delegated ownership of legacy branches to people he trusts but doesn't really care about them as they pretty much only get security patches.

The Perl programming language has Larry Wall as their BDFL and the BDFL delegates day-to-day responsibility to a community member called a "pumpking" (or pumpkin, story is in perlhist). That "pumpking" is responsible for the day-to-day patch integration and such freeing up Larry to make strategic-only decisions. If Chip is going to focus on P2, this may be the better model. The "pumpking" role rotates over time through a small group of people - you can see 20 years of pumpking history here:

http://perldoc.perl.org/perlhist.html

You'll also notice that there are different pumpkings for maintenance and dev.

Red

Heater. · 2014-08-12 22:16

David,

...but I'd like to think that I could make a contribution...

You may well be one of those three guys I mentioned...

__red__

Yes, I'm sure managing an OOS project can be a huge amount work. The Linux kernel was averaging 200 changes per day in 2013 ! No single human being could handle that sensibly. Already we see the reset bug fix for the P1 design that was discussed here for a day or two and took up I don't know how much of Chip's time to think about. And that's only a couple of lines of patch.

At some point divide and conquer is required with significant areas of work identified and owned by trusted "pumpkins" or "lieutenants" (Hey, we could call such subsystem owners "cogs" in the P1 developer organization).

So I image, for example, someone will step up and manage a Xilinx port. All Xilinx changes will go through them. The trick is that whatever they put in there will not break what already works for the Altera. Chip's role at that point is to try and be sure the Altera build does not break and merge the Xilinx stuff. It should not matter to him if the Xilinx build never works. Not his problem.

I guessing that however this is sliced and diced and who ends up play what role will have to emerge as we go along.

Ale · 2014-08-13 07:00

I thought there were 4 free opcode slots. MUL (16x16, not because 32x32 is not doable but because we would need to give a 64 bit result), MULSigned and of course ADDBCD and SUBBCD ;-) Fused mul-add would be great, that is free

David Betz · 2014-08-13 07:06

Ale wrote: »

I thought there were 4 free opcode slots. MUL (16x16, not because 32x32 is not doable but because we would need to give a 64 bit result), MULSigned and of course ADDBCD and SUBBCD ;-) Fused mul-add would be great, that is free

One problem with doing only 16x16 multiply is that it wouldn't really be useful for C and maybe not for Spin either. Unless you can make a 32x32 multiply out of a bunch of 16x16 multiplies.

Bill Henning · 2014-08-13 07:13

You can. Four 16x16 multiplies with appropriate adds is the same as a 32x32 multiply.

For unsigned 32x32: (AB = 32 bit word one, CD = 32 bit word two, A/B/C/D is the 16 bit halfs of the longs)

AB
CD

(D*B)+(D*A<<16)+(C*B<<16)+(C*A<<32) gives 64 bit result

For signed 32x32:

- preserve the signs of AB and CD
- take the absolute value of both
- do unsigned multiplication as above
- if the sign of AB is different than sign of CD, result = 0 - result

David Betz wrote: »

One problem with doing only 16x16 multiply is that it wouldn't really be useful for C and maybe not for Spin either. Unless you can make a 32x32 multiply out of a bunch of 16x16 multiplies.

David Betz · 2014-08-13 07:22

Bill Henning wrote: »

You can. Four 16x16 multiplies with appropriate adds is the same as a 32x32 multiply.

For unsigned 32x32: (AB = 32 bit word one, CD = 32 bit word two, A/B/C/D is the 16 bit halfs of the longs)

AB
CD
(D*B)+(D*A<<16)+(C*B<<16)+(C*A<<32) gives 64 bit result

For signed 32x32:

- preserve the signs of AB and CD
- take the absolute value of both
- do unsigned multiplication as above
- if the sign of AB is different than sign of CD, result = 0 - result

I figured that would be the case. We might want to name this instruction "MUL16" instead of "MUL" since I think people would assume that "MUL" would mean a 32x32 multiply.

David Betz · 2014-08-13 07:24

Cluso99 wrote: »
David,
Here's my understanding...

In cog.v there are 5 states, m0-m4. The normal cycle is m0..m4 and are the normal states we refer to in the P1...
For the current instruction
m0: Fetch instruction S contents
m1: Fetch instruction D contents
m2: Execute instruction & Fetch next instruction
m3: Writeback Result
m0: .....etc
Now,
m4: Is a wait caused by the execution of a waitx instruction (eg waitcnt, waitpeq, etc)

Don't forget these are overlapped stages for the 2 instructions that overlap that we used to refer to as
I d S D e R
        I  d  S  D  e  R
Hopefully you can now follow what happens in each state as coded in the // process and following sections.

Does this help?

I just spent an hour studying the cog.v, cog_alu.v and cog_hub.v files on my flight to Philadelphia this morning and, because of your description of the instruction sequencing, I think was able to understand most of it. Thanks! I hadn't realized that the m[] array was essentially a "one hot" vector.

Bill Henning · 2014-08-13 07:36

Makes sense. Given the multipliers in the FPGA, I am sure MUL16 would be fast

David Betz wrote: »

I figured that would be the case. We might want to name this instruction "MUL16" instead of "MUL" since I think people would assume that "MUL" would mean a 32x32 multiply.

Ale · 2014-08-13 08:19

We can 32x32 but you have to discard either the low or the high 32 bits of the result. May be using the C flag to select the part.of the result you want you can sort of avoid having two opcodes for it. It would mean using an extra move to reload the argument...
mov low, x
Mul low,y // next mul gets high part
Mul x,y

Result in x,low

Something like that... or a special opcode to retrieve a hidden high part like in mips

pik33 · 2014-08-13 08:29

Add a temporary alu register. Let mul place lower 32 bits in destination, higher 32 bits in temporary alu register. Add an opcode which will move this temporary register to the destination. Maybe using wc, which is not needed by mul

So we will be able to do

     mul s,d1          //32 lower bits of s8d =>d
     mul xxx,d1 wc // 32 higher bits of previous mul=>d1, xxx-don't care

Cluso99 · 2014-08-13 13:13

David Betz wrote: »

I just spent an hour studying the cog.v, cog_alu.v and cog_hub.v files on my flight to Philadelphia this morning and, because of your description of the instruction sequencing, I think was able to understand most of it. Thanks! I hadn't realized that the m[] array was essentially a "one hot" vector.

I am pleased it helped.
I found that I could understand most of the verilog. Unfortunately, understanding and modifying are two different things.
Some changes i can do easily with a little reference to some verilog docs. But others are more subtle as i am finding out now.
Nothing like learning on the fly. (and pardon the pun)

Demands on Chip after August + and some possible design goals

Comments