The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

evanh · 2014-04-18 18:54

Cluso99 wrote: »

A WIN-WIN.

The inverse of what happened to the old design.

The stretched ALU time might have a pipeline like side effect on how soon you can access the results though. Delayed branching anyone?

Bill Henning · 2014-04-18 19:03

How about a counter, off sysclk, with a mux per pin counter to select from:

sysclk
sysckl/2
...
sysclk/256

for the pin counters?

cgracey wrote: »

Pin smarts may need to be less than 32 bits.

Phil Pilgrim (PhiPi) · 2014-04-18 19:06

I have to confess, I never much cared for some of the entrails going into the sausage machine; but I'm finally starting to like what's coming out!

-Phil

evanh · 2014-04-18 19:12

It's a big room with 50 whiteboards and a pile of valves in the middle.

ozpropdev · 2014-04-18 19:38

Roy Eltham wrote: »

ozpropdev,
Are you sure that's what the ROLXXXn/RORXXXn stuff does? I would have expected them to decide which subpart (nibble,byte,word) of the D register to ROL/ROR with the n part of the opcode, and how much to ROL/ROR with the S part.
So for ROLNIBn, would have 3 bits to select which NIBBLE (the n in the opcode name), and S to select how much to ROL. So you'd have ROLNIB0 through ROLNIB7, etc.

Another way to represent them in source code would be: ROLNIB D, S/#, n

Cluso,
AUGDS D, S <- does AUGD with the D, and AUGS with the S, but all in one instruction.

Roy
In the P2 the GETNIB instruction used the format GETNIB D,S,#n where the P1+ appears to use
the format GETNIBn D,S, perhaps to make things a bit easier for PNUT.
IIRC Chip mentioned somewhere on this thread of combining GETNIB and ROLNIB.
Based on the P1+ instruction set I assume that happened.
There was no RORNIB in P2 so a ESWAP4,ROLNIB,RSWAP4 combination achieved that.

AFAIK there is no AUGDS supported in P2 as a single encoded instruction or as a alias in PNUT.
I think Cluso was suggesting it as an addition?

Cheers
Brian

ozpropdev · 2014-04-18 19:44

cgracey wrote: »

The system clock will be 200MHz. The cogs will take two clocks per instruction, so they'll run at 100 MIPS. It's looking likely that the main clock could go 250MHz. In that case, 125 MIPS x 16 cogs = 2000 MIPS total.

cgracey wrote: »

Right now, it's looking like the Cyclone IV FPGA boards (DE0/DE2) will be able to hit 200MHz. All the ALU stuff was proven yesterday for at least 300MHz before the last couple mux stages. I've been partitioning the ALU into 4 big chunks, each with its own input flipflops, so this should cut ALU power by 75% of what it would otherwise be. And aside from ALU in this design, there's not much circuitry in a cog. Power will go really low during WAITCNT/WAITPxx. So, I think the DE0/DE2 boards will be good for actual silicon speed. And this is from Altera Quartus' own timing analysis. I hope to have the cog design all tied together in another week.

Chip
x200-250 Wow!
Silicon speeds on FPGA wow!.
DE0 Nano will be awesome with 4 Cogs and 200MHz
Awesome is not a big enough word to describe a DE2 with 16 cogs @200 MHz!!!!
The head is spinning right now

Well done! You the man!

Cheers
Brian

Cluso99 · 2014-04-18 19:44

All, I have updated the sticky with the latest proposals.
Please keep posts there on topic (because it's a sticky and we don't want a summary polluted). Thanks.

Ariba · 2014-04-18 19:47

cgracey wrote: »
I made a better timing diagram for the new memory scheme:
5ns clock period
------------____________------------____________------------____________------------____________------------____________------------____________-

|			|			|			|			|			|			|
|-------+		|	       rdRAM Ic |-------+		|	       rdRAM Id |-------+		|	       rdRAM Ie |
|	|		|			|	|		|			|	|		|			|
|---+	+----> rdRAM Db |------------> latch Db |---+	+----> rdRAM Dc |------------> latch Dc |---+	+----> rdRAM Dd |------------> latch Dd |
|---+	+----> rdRAM Sb |------------> latch Sb |---+	+----> rdRAM Sc |------------> latch Sc |---+	+----> rdRAM Sd |------------> latch Sd |
|---+	+----> latch Ib	|------------> latch Ib	|---+	+----> latch Ic	|------------> latch Ic	|---+	+----> latch Id	|------------> latch Id	|
|   |			|			|   |			|			|   |			|			|
|   +------------------ALU-----------> wrRAM Ra	|   +------------------ALU-----------> wrRAM Rb	|   +------------------ALU-----------> wrRAM Rc	|
|			|			|			|			|			|			|
|			|	<wait a>	|			|	<wait b>	|			|	<wait c>	|
|			|			|			|			|			|			|
Cog RAM register indirection will be achieved by substituting D/S fields within the instruction being read from the RAM, before it feeds the rdRAM D/S inputs and before it's latched. There is time to do this, but there is not enough time to look at the instruction data coming out of the RAM (which arrives near the end of the cycle), make some decision based on it, and then buffer up to drive some mux's to do D/S field substitutions in time to feed the rdRAM D/S inputs. The latter would be required to achieve INDx like we had on Prop2. In the new scheme, substitute D/S values are ready to go and only need to be muxed into the instruction data coming out of the cog RAM.

Note that on that diagram, at the beginning of every clock, two memory accesses are initiated with the dual-port cog RAM. On one cycle you initiate both the instruction read and the result write. On the other cycle, you issue reads for both D and S registers. Also note that there are two full clock periods for the ALU result to settle in, while the ALU inputs are held steady from flipflops. The slowest ALU path is through the rotator (ROR/ROL/RCR/RCL/SHR/SHL/SAR/REV). Quartus is showing that circuit's Fmax at 150MHz, which would mean a 300MHz system clock. Once the last mux's are tied in and data-forwarding is implemented, I think 200MHz shouldn't be a problem.

Looking at the timing diagram, I think for selfmodifying code you need now 2 spacer instructions before you can read/execute the modified register. Is this the same with hubreads - how long do you need to wait after a RDLONG or a RDQAUD until you can read/execute the loaded data?
I know with ALTDS we can modify the S and D field much faster, so we don't need self-modify for indirect addressing now.

Andy

jmg · 2014-04-18 19:55

cgracey wrote: »

Pin smarts may need to be less than 32 bits.

Nooooooooo ......

Does that mean the NCO has moved to the pins ?

What MHz can the pin runs to, for 32 bits ?

With a lot of other parts offering 32 bit counters, pulling back on that is not going to look good, as well as giving one point where the P1+ is less than a P1

I did some Verilog tests on PinCells a few days back in Lattice parts, and got 205~240MHz reported speeds after P&R
Bobbled about a little - similar process node I think to the Cyclone IV

Seairth · 2014-04-18 20:02

cgracey wrote: »

I made a better timing diagram for the new memory scheme:

5ns clock period
------------____________------------____________------------____________------------____________------------____________------------____________-

|			|			|			|			|			|			|
|-------+		|	       rdRAM Ic |-------+		|	       rdRAM Id |-------+		|	       rdRAM Ie |
|	|		|			|	|		|			|	|		|			|
|---+	+----> rdRAM Db |------------> latch Db |---+	+----> rdRAM Dc |------------> latch Dc |---+	+----> rdRAM Dd |------------> latch Dd |
|---+	+----> rdRAM Sb |------------> latch Sb |---+	+----> rdRAM Sc |------------> latch Sc |---+	+----> rdRAM Sd |------------> latch Sd |
|---+	+----> latch Ib	|------------> latch Ib	|---+	+----> latch Ic	|------------> latch Ic	|---+	+----> latch Id	|------------> latch Id	|
|   |			|			|   |			|			|   |			|			|
|   +------------------ALU-----------> wrRAM Ra	|   +------------------ALU-----------> wrRAM Rb	|   +------------------ALU-----------> wrRAM Rc	|
|			|			|			|			|			|			|
|			|	<wait a>	|			|	<wait b>	|			|	<wait c>	|
|			|			|			|			|			|			|

Wait. Does this mean that instructions are actually taking 4 clock cycles, but due to semi-pipelining are effectively 2 clock cylces? And if this is the case, do branches take 4 cycles, with the following instruction being cancelled?

cgracey · 2014-04-18 20:12

Ariba wrote: »

Looking at the timing diagram, I think for selfmodifying code you need now 2 spacer instructions before you can read/execute the modified register. Is this the same with hubreads - how long do you need to wait after a RDLONG or a RDQAUD until you can read/execute the loaded data?
I know with ALTDS we can modify the S and D field much faster, so we don't need self-modify for indirect addressing now.

Andy

Because of the data-forwarding circuitry, which passes the ALU result to the next I/D/S if the addresses match, you will only need one spacer instruction before executing a modified register. For RDQUAD, you will need two spacers before execution, unless I make the data-forwarding circuit handle 4x the data, which is overkill. For RDQUAD, you will need one spacer before you can use D/S to access the new longs, unless, again, I make the data-forwarding circuit handle 4x the data. Wait! If we just log the first address and the first long of the RDQUAD into the data-forwarding circuit, we'll be able to pass results off right away if the next I/D/S wanted that first register of the recent RDQUAD. So, for cases in which I/D/S want to access the first long of a RDQUAD right away, only one spacer instruction will be needed before execution, and none for D/S access - this is just like any other instruction. That's how we'll do it. In most every case, it will be the first long of the RDQUAD that would be executed/accessed, anyway.

cgracey · 2014-04-18 20:13

jmg wrote: »

Nooooooooo ......

Does that mean the NCO has moved to the pins ?

What MHz can the pin runs to, for 32 bits ?

With a lot of other parts offering 32 bit counters, pulling back on that is not going to look good, as well as giving one point where the P1+ is less than a P1

I did some Verilog tests on PinCells a few days back in Lattice parts, and got 205~240MHz reported speeds after P&R
Bobbled about a little - similar process node I think to the Cyclone IV

We'll have to see, yet. The cogs each have two CTRs which are 32-bit.

cgracey · 2014-04-18 20:15

Seairth wrote: »

Wait. Does this mean that instructions are actually taking 4 clock cycles, but due to semi-pipelining are effectively 2 clock cylces? And if this is the case, do branches take 4 cycles, with the following instruction being cancelled?

That's exactly right.

jmg · 2014-04-18 20:22

cgracey wrote: »

We'll have to see, yet. The cogs each have two CTRs which are 32-bit.

So the COGs have "P1 Counters", at 200MHz+ ? (minus the PLL option?)

The Pin cells should be able to operate at the same speed as the COG Counter-Adder ?

cgracey · 2014-04-18 20:44

jmg wrote: »

So the COGs have "P1 Counters", at 200MHz+ ? (minus the PLL option?)

The Pin cells should be able to operate at the same speed as the COG Counter-Adder ?

Oh, you're right! I'll need to run those CTRs through Quartus and see if they're up to running at 200MHz.

Ariba · 2014-04-18 21:02

cgracey wrote: »

Because of the data-forwarding circuitry, which passes the ALU result to the next I/D/S if the addresses match, you will only need one spacer instruction before executing a modified register. For RDQUAD, you will need two spacers before execution, unless I make the data-forwarding circuit handle 4x the data, which is overkill. For RDQUAD, you will need one spacer before you can use D/S to access the new longs, unless, again, I make the data-forwarding circuit handle 4x the data. Wait! If we just log the first address and the first long of the RDQUAD into the data-forwarding circuit, we'll be able to pass results off right away if the next I/D/S wanted that first register of the recent RDQUAD. So, for cases in which I/D/S want to access the first long of a RDQUAD right away, only one spacer instruction will be needed before execution, and none for D/S access - this is just like any other instruction. That's how we'll do it. In most every case, it will be the first long of the RDQUAD that would be executed/accessed, anyway.

Thank you Chip, sounds really good

Cluso99 · 2014-04-18 21:09

Chip,

We can effectively modify the D & S of an instruction (previously using MOVD & MOVS, now SETD & SETS) by ALTDS. ALTDS is necessary for hubexec self-modifying code.

Might an ALTIX instruction overcome modifying the instruction opcode (opcode/zc/i/cccc) for hubexec mode.
You could use the top 4 bits of D to indicate if opcode / zc / i / cccc bits were to be modified, and the remaining 5x D bits plus 9x D bits to be the opcode/zc/i/cccc bits.
Honestly, I don't know how useful this might be. I just thought I would mention it as I know you have used instruction modification in the Spin Interpreter.

kwinn · 2014-04-18 21:14

I'm not sure it was done for the 6809 either, although I think you are right. IIRC there were two ways to do this at the time. A very very expensive logic analyzer could read the address and data bus and count each instruction the micro executed. It could also count how many times a data location was accessed.

For those of us on a smaller budget it could be done by modifying the program to add counters to track how many times each subroutine or loop of the program was executed. In essence a table of frequencies for the instructions in every loop and subroutine of the program was created, the instruction frequencies were multiplied by the number of times the loop or subroutine was executed, and the result for each instruction added for a total. That was not an easy task.

Heater. wrote: »

It's an odd thing this instruction counting business. I remember Motorola explaining how they analysed a lot of existing software for instruction frequency before coming up with the 6809 (I think) instruction set. I always wondered how they actually did that.

One can't simply assume that because an instruction occurs infrequently in peoples source code that we could perhaps not bother implementing it in the CPU.

One could imagine a "rare" instruction that only occurs once in any program but is actually essential, like say a COGINIT or some such.

Or a performance enhancing instruction, MUL16 say, that only ever occurs once in a program but is key to getting adequate performance is actually called a lot at run time.

What might be interesting is to know is how many times each instruction is called in real running programs. Is it in the critical tight loop that makes the whole program possible?

This is a bit hard to determine from the source code without actually reading it all and finding the time critical parts.

kwinn · 2014-04-18 21:21

BTW, I think the propeller would be very good for this type of application. The emulators and LMM could be modified to count the number of times an instruction is executed for software that is not limited too much by RT timing constraints.

Heater. · 2014-04-19 01:31

Chip,

So ROLNIB should actually be SHLNIB.

But that's ugly, I'd go for.

SHL4
SHL8

etc.

200MHz on the FPGA is incredible.

Phil,

Looks like there are no entrails going into this sausage, it's all pure muscle.

Baggers · 2014-04-19 01:38

Chip, that is awesome news

16 cogs at 100 MIPS on DE2 is gonna blow minds! And if they can go 125 MIPS that would just be out of this world!

All this at low power too! You sir are a genius!

Cheers,
Jim.

cgracey · 2014-04-19 01:43

Heater. wrote: »

Chip,

So ROLNIB should actually be SHLNIB.

But that's ugly, I'd go for.

SHL4
SHL8

But it's a little different than that, more in the spirit of RCL than SHL.

ROLNIB5 D,S - Get nibble #5 from S and rotate it left into D (by 4 bits)

ROLNIB0..7 and RORNIB0..7 exist, as well as ROLBYT0..3, RORBYT0..3, ROLWRD0..1, and RORWRD0..1.

evanh · 2014-04-19 02:03

Has to roll back into the same register, forming a circular loop, to be a rotation doesn't it? In this case D would also have to feed into S. Otherwise it's just a shift.

evanh · 2014-04-19 02:44

Chip, with the stretched ALU timing, does this give leeway for the hardware stack being converted to using CogRAM in place of the 4 deep LIFO? I envisage it still retaining the dedicated stack pointer, but addressing CogRAM instead of LIFO.

jmg · 2014-04-19 03:37

evanh wrote: »

Chip, with the stretched ALU timing, does this give leeway for the hardware stack being converted to using CogRAM in place of the 4 deep LIFO? I envisage it still retaining the dedicated stack pointer, but addressing CogRAM instead of LIFO.

ALU timing would be separate, and same-memory stacks usually need more cycles, but I think Chip has said Branches are now 4 cycles, so there may be cycle-space in that, to use CogRAM as stack ?

Cluso99 · 2014-04-19 22:17

I have been looking at the new proposed instruction set.

Some of these instructions seem to be a carry over from the P2 and do not seem to be so relevant now that we will have hubexec. Some of them cater more for cog execution where we did tricks because of the cog space restriction. Some just seem to be there because they are reverse equivalents.

You may wonder why I am asking this. Well, everything uses silicon. If some of these instructions are unnecessary, maybe we can save enough silicon to add something more beneficial, such as a deeper LIFO.

Are these instructions now really necessary ???
SETCOND, GETCOND, GETI, GETD, GETS

Are these necessary (maybe they are for video modes???) ???
RORNIBn, RORBYTn, RORWRDn, ROLNIBn, ROLBYTn, ROLWRDn

I presume these are necessary, but worthy for comment...
INCMOD, DECMOD, ESWAP4, ESWAP8, SPLITW, MERGEW, TOPBIT, DECOD, PICKZC

PICKZC - helps restore Z & C flags quickly from any bit pair. Extremely useful IMHO. Also permits 4 state decoding from 2 bits simply.
ESWAP8 - useful for reordering bytes in a long (big vs little endian)

Do we need this many compare variations (note we no longer have NR option on ADDnn/SUBnn) ???
TESTB, TESTN, TEST, CMP, CMPX, CMPS, CMPS, CMPSX, CMPR, CMPSUB

TEST, TESTN - both extremely useful.
CMP - Obviously required IMHO.
CMPR - I have found this extremely useful.

A TARG instruction may avoid some of these compare variations?
TARG was an instruction Chip used in the P2 to redirect the result of the following instruction to another location (does not overwrite D). By selecting a read-only register, NR could be simulated (using the additional instruction). But the real benefit of TARG was the ability of directing the result to a different register, thereby making a simple C = D xxx S (does not overwrite D).

ozpropdev · 2014-04-20 00:25

Cluso99 wrote: »

I have been looking at the new proposed instruction set.

Some of these instructions seem to be a carry over from the P2 and do not seem to be so relevant now that we will have hubexec. Some of them cater more for cog execution where we did tricks because of the cog space restriction. Some just seem to be there because they are reverse equivalents.

You may wonder why I am asking this. Well, everything uses silicon. If some of these instructions are unnecessary, maybe we can save enough silicon to add something more beneficial, such as a deeper LIFO.

Are these instructions now really necessary ???
SETCOND, GETCOND, GETI, GETD, GETS

Are these necessary (maybe they are for video modes???) ???
RORNIBn, RORBYTn, RORWRDn, ROLNIBn, ROLBYTn, ROLWRDn

I presume these are necessary, but worthy for comment...
INCMOD, DECMOD, ESWAP4, ESWAP8, SPLITW, MERGEW, TOPBIT, DECOD, PICKZC

PICKZC - helps restore Z & C flags quickly from any bit pair. Extremely useful IMHO. Also permits 4 state decoding from 2 bits simply.
ESWAP8 - useful for reordering bytes in a long (big vs little endian)

Do we need this many compare variations (note we no longer have NR option on ADDnn/SUBnn) ???
TESTB, TESTN, TEST, CMP, CMPX, CMPS, CMPS, CMPSX, CMPR, CMPSUB

TEST, TESTN - both extremely useful.
CMP - Obviously required IMHO.
CMPR - I have found this extremely useful.

A TARG instruction may avoid some of these compare variations?
TARG was an instruction Chip used in the P2 to redirect the result of the following instruction to another location (does not overwrite D). By selecting a read-only register, NR could be simulated (using the additional instruction). But the real benefit of TARG was the ability of directing the result to a different register, thereby making a simple C = D xxx S (does not overwrite D).

A few quick thoughts on the above instructions

ROLNIB would be useful for faster data builds from Quad-SPI reads. Depending on the video block may have use their too.
ESWAP4 used with ESWAP8 can swap nibbles in bytes. (Very low silicon impact, just signal routing? Basic Verilog code. Chip? jmg?)
TOPBIT is useful for magnitude assessment. (used to be ENCOD in P2 ?)
DECOD is useful for indexed bit masking. (Used to be DECOD5 in P2 ?)
SPLITW is useful for separating data for IO pin spacing (dual row header fan out) (Again, just signal routing?)
MERGEW opposite of above and can be used in memory address mapping (Roy Eltham had a post about that somewhere)
TESTB is very useful for testing flags without needing masks. Uses WZ and WC flags
TARG is possibly now going to be a function of ALTDS. Mode bits [8:6]. See here

Cheers
Brian

Roy Eltham · 2014-04-20 01:56

Cluso,
All of those are useful. Just because you don't find them useful or interesting, doesn't mean they should go away. They are not enough silicon to make much difference.
Please stop trying to get rid of stuff, thanks.

Cluso99 · 2014-04-20 02:02

Roy Eltham wrote: »

Cluso,
All of those are useful. Just because you don't find them useful or interesting, doesn't mean they should go away. They are not enough silicon to make much difference.
Please stop trying to get rid of stuff, thanks.

Roy,
I think you should read my post properly.
Do you find all of them useful??? This is what I asked.
There are at least some that I no longer think (within the recent discussions) have large usage.
Wouldn't you find a larger LIFO more beneficial???

RossH · 2014-04-20 02:06

Roy Eltham wrote: »

They are not enough silicon to make much difference.

After the recent experience with the original P2, I think anything that can be left out, should be left out. No individual feature or instruction will take us over the line and over the 2W boundary (which seems to be commonly accepted as being "reasonable" for a chip like the P16X32B) - but all together they probably will.

And don't forget all the additional complexity has to be exhaustively tested - so every instruction adds an increment to the delay before this chip can get to market.

Ross.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments