The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Brian Fairchild · 2014-04-15 00:26

How many opcodes are we up to now?

cgracey · 2014-04-15 01:21

Well, sorry to say this, but I was right in my first explanation of why INDA/INDB won't (easily) work.

Here is a clock cycle diagram. You can see that the state alternates on each cycle. If instruction 'a' must WAIT, that happens by deselecting an ENA that staves off all the activity in the <wait a> state:

------------____________------------____________------------____________------------____________----

|		read Db |	    register Db |		read Dc |	    register Dc |
|		read Sb |	    register Sb |		read Sc |	    register Sc |
|			|	 re-register Ib	|			|	 re-register Ic	|
|			|			|			|			|
|	    register Ib |		read Ic |	    register Ic |		read Id |
|			|	       write Ra |			|	       write Rb |
|			|			|			|			|
|			|	<wait a>	|			|	<wait b>	|

|----------------------ALU----------------------|----------------------ALU----------------------|

When 'register Ib' occurs, 'read Db' and 'read Sb' are issued from those same bits (D and S fields of Ib). The read D and S bits arrive just before 'register Db/Sb' - the same point at which 'read Ic' and 'write Ra' are being issued. So, not only is there no time to figure out if Db/Ds are indirect, but the write for the last instruction is being issued, along with the read for the next instruction. This clock, however, is one that can be delayed by an ENA, in order to wait for instruction 'a' to finish. If we wanted to do indirect, we would need to issue fresh S and D reads on the next clock, and then wait an extra clock in order to give the ALU the two clocks that it needs to settle the result before 'write Rb' is issued. This means indirects would take 4 clocks:

------------____________------------____________------------____________------------____________------------____________------------____________----

|		read Db |	    register Db |	       read Db'	|	   register Db'	|		read Dc |	    register Dc |
|		read Sb |	    register Sb |	       read Sb'	|	   register Sb'	|		read Sc |	    register Sc |
|			|	 re-register Ib	|			|			|			|	 re-register Ic	|
|			|			|			|			|			|			|
|	    register Ib |		read Ic |	    register Ic |			|			|		read Id |
|			|	       write Ra |			|			|			|	       write Rb |
|			|			|			|			|			|			|
|			|	<wait a>	|	<indirect>	|	<indirect>	|			|	<wait b>	|

|----------------------ALU----------------------|						|----------------------ALU----------------------|

This is so slow and needing so much hardware that it makes me think that the ALTDS instruction discussed earlier would be the optimal fit.

Cluso99 · 2014-04-15 01:58

cgracey wrote: »
Well, sorry to say this, but I was right in my first explanation of why INDA/INDB won't (easily) work.

Here is a clock cycle diagram. You can see that the state alternates on each cycle. If instruction 'a' must WAIT, that happens by deselecting an ENA that staves off all the activity in the <wait a> state:
------------____________------------____________------------____________------------____________----

|        read Db |        register Db |        read Dc |        register Dc |
|        read Sb |        register Sb |        read Sc |        register Sc |
|            |     re-register Ib    |            |     re-register Ic    |
|            |            |            |            |
|        register Ib |        read Ic |        register Ic |        read Id |
|            |           write Ra |            |           write Rb |
|            |            |            |            |
|            |    <wait a>    |            |    <wait b>    |

|----------------------ALU----------------------|----------------------ALU----------------------|
When 'register Ib' occurs, 'read Db' and 'read Sb' are issued from those same bits (D and S fields of Ib). The read D and S bits arrive just before 'register Db/Sb' - the same point at which 'read Ic' and 'write Ra' are being issued. So, not only is there no time to figure out if Db/Ds are indirect, but the write for the last instruction is being issued, along with the read for the next instruction. This clock, however, is one that can be delayed by an ENA, in order to wait for instruction 'a' to finish. If we wanted to do indirect, we would need to issue fresh S and D reads on the next clock, and then wait an extra clock in order to give the ALU the two clocks that it needs to settle the result before 'write Rb' is issued. This means indirects would take 4 clocks:
------------____________------------____________------------____________------------____________------------____________------------____________----

|        read Db |        register Db |           read Db'    |       register Db'    |        read Dc |        register Dc |
|        read Sb |        register Sb |           read Sb'    |       register Sb'    |        read Sc |        register Sc |
|            |     re-register Ib    |            |            |            |     re-register Ic    |
|            |            |            |            |            |            |
|        register Ib |        read Ic |        register Ic |            |            |        read Id |
|            |           write Ra |            |            |            |           write Rb |
|            |            |            |            |            |            |
|            |    <wait a>    |    <indirect>    |    <indirect>    |            |    <wait b>    |

|----------------------ALU----------------------|                        |----------------------ALU----------------------|
This is so slow and needing so much hardware that it makes me think that the ALTDS instruction discussed earlier would be the optimal fit.

At 4 clocks, it would be no slower than having used 2 instructions (ALTDS + instruction). The hardware (silicon/power) is something I think only you can decide. Obviously we would like the hw to do it all for us because it saves the instruction and hence memory. But we also have to be realistic!

BTW Thanks for the timing diagram and info.

Roy Eltham · 2014-04-15 02:50

Cluso,
I think he means 4 extra clocks, so they would be 6 clocks total (looking at his diagram it shows what looks like 6 clock areas for the indirect version that works). So the two instruction version (with ALTDS) would be 4 clocks total, but requires 2x memory.

I think it's an acceptable situation. We had worse requirements before in P1, needing two instructions and a gap, and that's without any inc/dec. So the two instruction version that gives us both D and S and inc/dec is a big reduction from the same support on P1 (which in worst case needs 5 instructions along with possible extra gaps required). I think we can live quite happily with this version, especially if it keeps the core simple and fast.

We do already have a fair number of new instructions that reduce the size of code compared to equivalent P1 versions. So I think we'll find we are able to do more than we used to in the core local memory.

I dunno about you, but I am okay with not getting some conveniences if it means staying at 200Mhz or even above that!

potatohead · 2014-04-15 02:53

I really like the sound of optimal at this point. Adding an instruction, keeping it simple, with a reasonable net time consumption is the way to go. We keep design complexity on critical paths simple, potential clock speeds high, power budget appropriate, and it works, and we can work on presenting it to users in ways that make sense too.

Keeping the clock high is really important for a lot of things. Seconded.

More importantly, this one can be marked "done", and that's something to be aware of too.

jmg · 2014-04-15 02:57

Cluso99 wrote: »

At 4 clocks, it would be no slower than having used 2 instructions (ALTDS + instruction). The hardware (silicon/power) is something I think only you can decide. Obviously we would like the hw to do it all for us because it saves the instruction and hence memory.

The ALTDS instruction is presented to the user as one opcode by the Assembler, and does in-line self modify, so the minus is less than ideal code size, but this is not going to be used in a lot of places (ie not like a jump), and the Auto INC variants 'buy-back' that memory ?

An ALTDS instruction could be done now, to get some functional numbers, and revisited later if the die looks empty (unlikely?)

Cluso99 · 2014-04-15 03:02

Roy Eltham wrote: »

Cluso,
I think he means 4 extra clocks, so they would be 6 clocks total (looking at his diagram it shows what looks like 6 clock areas for the indirect version that works). So the two instruction version (with ALTDS) would be 4 clocks total, but requires 2x memory.

I think it's an acceptable situation. We had worse requirements before in P1, needing two instructions and a gap, and that's without any inc/dec. So the two instruction version that gives us both D and S and inc/dec is a big reduction from the same support on P1 (which in worst case needs 5 instructions along with possible extra gaps required). I think we can live quite happily with this version, especially if it keeps the core simple and fast.

We do already have a fair number of new instructions that reduce the size of code compared to equivalent P1 versions. So I think we'll find we are able to do more than we used to in the core local memory.

I dunno about you, but I am okay with not getting some conveniences if it means staying at 200Mhz or even above that!

Thanks Roy. I hadn't looked closely enough to realise it was 6 clocks.
Yes, definitely want to stay 200MHz or above if possible. The main problem is with hubexec anyway, so I would rather live with restrictions than lose it.
Maybe there is something else we can do for the CALL/RET and PUSH/POP case for in-cog stack(s).
I will anxiously wait to hear from Chip.

Baggers · 2014-04-15 07:43

Bill Henning wrote: »

1: yes

2: no

Thanks Bill

tonyp12 · 2014-04-15 08:36

>GETNIB/BYTE/WORD should perform a ROL function, too
Do like ARM, most instructions have the optional Shift, so mov and shift/rol are just the same but the mov is with a 0 shift
Though it seems only be available on the Source and you can not shift the Destination.
Though LDR R12,R12 #shr2 probably could do it but would waste a line of code.

P3 should be 64bit:
10bit opcode, 3x 11bit regs + 2bit sh/rol + 16bit intermediate + 3bit set z/c/n flags

cgracey · 2014-04-15 09:04

These indirect instructions would take 4 clocks, not 6.

I'm still evaluating what to do here.

Seairth · 2014-04-15 09:13

cgracey wrote: »

These indirect instructions would take 4 clocks, not 6.

I'm still evaluating what to do here.

I missed something. What is wrong with the ALTDS approach, other than the extra instruction?

cgracey · 2014-04-15 09:44

Seairth wrote: »

I missed something. What is wrong with the ALTDS approach, other than the extra instruction?

INDx would take less minding and be easier to understand.

I looked at my most complex program, the ROM_Monitor, and it uses only FOUR indirect accesses.

The ROM_SHA256 program uses TWENTY indirect accesses. Half of these cases are preceded by a SETINDx instruction, which would become an ALTDS, creating maybe a net TEN-instruction gain.

One good thing about ALTDS is that it is just a regular two-clock instruction, giving other tasks less hiccups than four-clock INDx instructions would.

Bill Henning · 2014-04-15 09:56

I think the separate ALTDS instruction makes sense here.

Can you still keep the the INDS based call/ret to two cycles, as one instruction? That could be a simple --INDx for saving the PC, and INDx++ for restoring it. (this way the stack can grow down from the top of the cog)

Or, if it is easier, just have a stack pointer register, and have the CALL/RET use that. SETSP/GETSP and then it does not even have to be exposed as a register, and I suspect could be kept to 2 cycles.

The reason for the above suggestion... CALL/RET is used too often to use a two instruction sequence, as it would waste too much memory.

cgracey wrote: »

INDx would take less minding and be easier to understand.

I looked at my most complex program, the ROM_Monitor, and it uses only FOUR indirect accesses.

The ROM_SHA256 program uses TWENTY indirect accesses. Half of these cases are preceded by a SETINDx instruction, which would become an ALTDS, creating maybe a net TEN-instruction gain.

One good thing about ALTDS is that it is just a regular two-clock instruction, giving other tasks less hiccups than four-clock INDx instructions would.

cgracey · 2014-04-15 10:13

Bill Henning wrote: »

I think the separate ALTDS instruction makes sense here.

Can you still keep the the INDS based call/ret to two cycles, as one instruction? That could be a simple --INDx for saving the PC, and INDx++ for restoring it. (this way the stack can grow down from the top of the cog)

Or, if it is easier, just have a stack pointer register, and have the CALL/RET use that. SETSP/GETSP and then it does not even have to be exposed as a register, and I suspect could be kept to 2 cycles.

The reason for the above suggestion... CALL/RET is used too often to use a two instruction sequence, as it would waste too much memory.

Any extra access into cog RAM is going to take another two clocks. The only way around this is to place logic after the RAM data outputs. That would lengthen the clock period, but give us easy INDx and CALL/RET via cog RAM.

Bill Henning · 2014-04-15 11:16

I understand.

The problem is that CALL/RET are very common, so using two longs will waste a lot of memory, unlike the general ALTDS case, which will happen a lot less frequently

Would not having a dedicated SP, composed of flipflops help?

CALL addr

1: decrement SP
2: latch addr into regs[sp], pc = addr

RET addr

1: pc = regs[sp]
2: increment sp

Just trying to learn here

cgracey wrote: »

Any extra access into cog RAM is going to take another two clocks. The only way around this is to place logic after the RAM data outputs. That would lengthen the clock period, but give us easy INDx and CALL/RET via cog RAM.

cgracey · 2014-04-15 12:18

I hate to say this, but hub execution is a huge headache. It necessitates so much complexity. I'm having a really hard time getting a handle on what can be done to cinch this in a timely manner when hub exec is involved.

From a Prop1 perspective, there are many simple things that can be done to enhance the architecture, like having nibble/byte/word operations, PTRA and PTRB, quad read and write, smart pins, a 4-level LIFO stack, pin and bit operations, a 16x16 multiplier, edge waiting, and other non-universe-transforming features.

For now, I want to proceed without hub exec. I need to get some progress underway.

Christof Eb. · 2014-04-15 12:19

Hi,
along with more ram, adc and dac, higher speed and low power consumption hardware multiply and perhaps divide is the most needed update for the prop!
Best regards, Christof

Dave Hein · 2014-04-15 12:33

So without hubex will there still be an instruction cache that is independent of the data cache? This would need a RDLONGI to read a long from the instruction cache versus the RDLONGC that reads from the data cache. Or maybe Bill has a better idea for doing LMM without an instruction cache.

Also I'm curious about how many instructions there are now.

Kerry S · 2014-04-15 12:35

cgracey wrote: »

I hate to say this, but hub execution is a huge headache.

For now, I want to proceed without hub exec. I need to get some progress underway.

Sorry to hear that Chip. The improved instruction set will help a lot.

What made the new hub shine was hubexec. With loss of hubexec can we somehow get more cog ram instead of the huge hub that now is not nearly as effective to use? Or some simple aux-ram like setup for storing larger data sets locally so our very limited cog program space can be maximized?

Bill Henning · 2014-04-15 12:49

Chip,

Why don't you push out an fpga image without hubexec... and take a well deserved vacation with your family?

One thing I've learned from wifey is that vacations are amazingly refreshing and re-energizing. Which is why we try to go on them as often as possible

I think after a week or two of relaxation, rest and family you will come back bright eyed and bushy tailed, with more excellent ideas.

cgracey wrote: »

I hate to say this, but hub execution is a huge headache. It necessitates so much complexity. I'm having a really hard time getting a handle on what can be done to cinch this in a timely manner when hub exec is involved.

From a Prop1 perspective, there are many simple things that can be done to enhance the architecture, like having nibble/byte/word operations, PTRA and PTRB, quad read and write, smart pins, a 4-level LIFO stack, pin and bit operations, a 16x16 multiplier, edge waiting, and other non-universe-transforming features.

For now, I want to proceed without hub exec. I need to get some progress underway.

jazzed · 2014-04-15 12:52

cgracey wrote: »

I hate to say this, but hub execution is a huge headache. It necessitates so much complexity. I'm having a really hard time getting a handle on what can be done to cinch this in a timely manner when hub exec is involved.

Great.

Jettison it permanently.

cgracey · 2014-04-15 12:53

Bill Henning wrote: »

Chip,

Why don't you push out an fpga image without hubexec... and take a well deserved vacation with your family?

One thing I've learned from wifey is that vacations are amazingly refreshing and re-energizing. Which is why we try to go on them as often as possible

I think after a week or two of relaxation, rest and family you will come back bright eyed and bushy tailed, with more excellent ideas.

Yeah, I think I'm a little punch drunk. In the big picture, we need hub exec. Sorry for all this hemming and hawing.

cgracey · 2014-04-15 12:54

jazzed wrote: »

Great.

Jettison it permanently.

Are you just saying that to get things moving? I suppose you, working on C compilers, would be lamenting the absence of hub exec more than most.

jmg · 2014-04-15 12:56

cgracey wrote: »

From a Prop1 perspective, there are many simple things that can be done to enhance the architecture, like having nibble/byte/word operations, PTRA and PTRB, quad read and write, smart pins, a 4-level LIFO stack, pin and bit operations, a 16x16 multiplier, edge waiting, and other non-universe-transforming features.

For now, I want to proceed without hub exec. I need to get some progress underway.

Sounds a good idea for an interim release, there is a lot of new stuff in the first paragraph (plus things not mentioned like the common math resource), and a lot of test coverage can be done without Hubexec.

It will also allow a MHz/Area reality check, on what you have so far, and those numbers may reveal other ways to implement Hubexec.

You can reserve some opcodes for Hubexec, but have them nop for now, which also helps stabalize the Opcode map, and buys you time to think about the details of Hubexec.

ctwardell · 2014-04-15 13:02

cgracey wrote: »

Yeah, I think I'm a little punch drunk. In the big picture, we need hub exec. Sorry for all this hemming and hawing.

Do we really need it?

Sure it would be nice, but it has a cost in complexity and likely a cost in power consumption.

With RDQUAD improving the speed of LMM it may not be worth adding for this chip.

An ASIC in hand is worth way more than two in an FPGA.

C.W.

jmg · 2014-04-15 13:05

cgracey wrote: »

Yeah, I think I'm a little punch drunk. In the big picture, we need hub exec. Sorry for all this hemming and hawing.

+

cgracey wrote: »

...., working on C compilers, would be lamenting the absence of hub exec more than most.

The ability to easily manage large compiled programs is going to be very important on the final release, but there are a number of possible ways to solve that.

An interim release allows P1-like solutions to be brought up, and proven, and then some firm performance numbers will be available (including power estimations on those modes), plus you buy time to think.

Those numbers, together with the Sim checks on Chip Power Envelopes and Area, will give important reference points for how to manage 'Hubexec'.

cgracey · 2014-04-15 13:05

jmg wrote: »

Sounds a good idea for an interim release, there is a lot of new stuff in the first paragraph (plus things not mentioned like the common math resource), and a lot of test coverage can be done without Hubexec.

It will also allow a MHz/Area reality check, on what you have so far, and those numbers may reveal other ways to implement Hubexec.

You can reserve some opcodes for Hubexec, but have them nop for now, which also helps stabalize the Opcode map, and buys you time to think about the details of Hubexec.

I've been mentally combing through the issues here. I think what was getting me flustered was tying hub exec to cog RAM stacks. That's a headache! CALLA and CALLB are very simple, but slower.

This is all easy, and adequate:

CALLA/CALLB/RETA/RETB - necessary for hub exec, even just one set
CALL/RET - use 4-level LIFO stack, perfect for internal cog programs
LINK - useful for many things

I'll proceed with these. I want to have this nailed down before I sleep again. I need to get moving on the Verilog.

Ken Gracey · 2014-04-15 13:06

cgracey wrote: »

For now, I want to proceed without hub exec. I need to get some progress underway.

No problem here and we support the engineering process when tough decisions have to be made. I want it for the C compiler, but I didn't know how badly I wanted hubexec until Jazzed and Bill told me. But I also know that this issue might work itself out in the meantime.

Give it some time, move on to something more productive as Bill reaffirmed. Not sure about the vacation recommendation, though. I realize everybody keeps recommending a vacation to you, but it'd probably be best just to spend a couple of days in the orchard with a chainsaw.

Ken Gracey

Rayman · 2014-04-15 13:09

Well, we'll still have LMM mode, which is just a slower hubexec mode...
LMM should still work better with the faster speed and bigger hub ram.

Wonder if David Betz has any input on what might help LMM mode under PropGCC...

I guess one upshot of all this is that the PropGCC guys should have an easy time getting it working on this chip.

Bill Henning · 2014-04-15 13:14

Works great

Losing hubexec would hurt - a lot.

cgracey wrote: »

I've been mentally combing through the issues here. I think what was getting me flustered was tying hub exec to cog RAM stacks. That's a headache! CALLA and CALLB are very simple, but slower.

This is all easy, and adequate:

CALLA/CALLB/RETA/RETB - necessary for hub exec, even just one set
CALL/RET - use 4-level LIFO stack, perfect for internal cog programs
LINK - useful for many things

I'll proceed with these. I want to have this nailed down before I sleep again. I need to get moving on the Verilog.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments