The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

cgracey · 2014-04-14 13:18

David Betz wrote: »

Chip,

At one point you, or maybe it was Ken, suggested that you might make the RTL for P1 available after P2 shipped. Now it seems that the RTL for P1+ is going to be an extension of the RTL for P1. Do you still plan to release any RTL either before or after you ship the next chip? Did you by any chance archive the RTL for P1 before you started morphing it into P1+ or P2 or whatever the chip being described in this thread will be called?

Thanks,
David

We plan to release Prop1 code, at first.

David Betz · 2014-04-14 13:38

cgracey wrote: »

We plan to release Prop1 code, at first.

Great! I'm glad that's still part of the plan.

Sapieha · 2014-04-14 13:42

Hi Chip.

Good to hear
Have that have be good for my Serial-Com's experiments

cgracey wrote: »

We plan to release Prop1 code, at first.

cgracey · 2014-04-14 13:51

I made a mistake in thinking about and explaining the indirect issues. We CAN do an extra clock to achieve indirection, without any clock speed penalty. We would let the first clock go through, reading D and S according the D and S fields in the instruction. Then, on the next clock, we could issue the reads for indirect D and S using some implied hidden registers. This would help speed and code density, but would give us only maybe two indirect registers. I kind of like the universal approach using two instructions, but it is wasteful, compared to what could be done with implied indirection via special INDA/INDB register locations.

Bill Henning · 2014-04-14 14:08

EXCELLENT NEWS!!!

I knew you would figure out a way.

Ummm... could we have our cake, and eat it too?

Adding the universal ALT instruction gives us MANY additional pointers, and we can use INDA/INDB for the highest speed uses.

cgracey wrote: »

I made a mistake in thinking about and explaining the indirect issues. We CAN do an extra clock to achieve indirection, without any clock speed penalty. We would let the first clock go through, reading D and S according the D and S fields in the instruction. Then, on the next clock, we could issue the reads for indirect D and S using some implied hidden registers. This would help speed and code density, but would give us only maybe two indirect registers. I kind of like the universal approach using two instructions, but it is wasteful, compared to what could be done with implied indirection via special INDA/INDB register locations.

tonyp12 · 2014-04-14 14:48

> but it is wasteful, compared to what could be done with implied indirection via special INDA/INDB register
But does the INDA/INDB have optional post-inc/pre-dec flags?, as any space saved is lost if you have to put a sub before or a add after the INDx instruction

jmg · 2014-04-14 15:14

cgracey wrote: »
Regarding REPS and the code-size matter, remember that there's the $ for origin:
	REPS	#count,#:end-$
	inst
	inst
:end	inst

The problem with this simpler, vanilla form, is the code fails if that last inst opcode is a double-size one
Is there still a delay following REPS before the looping block, or has that gone ?

If REPS now starts immediately, then a single label form is ok, a finite delay needs two labels.

	REPS	#count,EndLoop
	inst
	inst
	inst  ' can be one or two sized inst
:EndLoop

addit :

Bill Henning wrote: »

Or:

	REPS	#count    ' assembler hides end-$ computation
  	   inst
	   inst
           inst
        ENDR

yes, that also works well, if there is no lead-in delay on REPS
I think REPS cannot be nested, so this form is fine, and if anyone does REPS..REPS ENDR ENDR it can spit an error

jmg · 2014-04-14 15:25

Bill Henning wrote: »

EXCELLENT NEWS!!!

I knew you would figure out a way.

Ummm... could we have our cake, and eat it too?

Adding the universal ALT instruction gives us MANY additional pointers, and we can use INDA/INDB for the highest speed uses.

I was wondering the same thing.

I'm unclear if Chip is meaning an extra Opcode clock (4 SysClk Opcode) or an extra SysClk (3 SysClkCycClk opcode) in #645 ?

Bill Henning · 2014-04-14 15:37

I think 3 cycclk

fyi,

I think everything should be documented in sysclocks, I've been confused a number of times, which suggests to me that those new to the prop will generally go HUH??? at the two clocks.

jmg wrote: »

I was wondering the same thing.

I'm unclear if Chip is meaning an extra Opcode clock (4 SysClk Opcode) or an extra SysClk (3 CycClk opcode) in #645 ?

jmg · 2014-04-14 15:52

Bill Henning wrote: »

I think everything should be documented in sysclocks, I've been confused a number of times, which suggests to me that those new to the prop will generally go HUH??? at the two clocks.

Yes, and if there are going to be 3 SysClk opcodes, that gives little choice, as you cannot really spec 1.5 OpCodeClks ?
That would make mnemonics 2/3/4 SysClks in speed.

potatohead · 2014-04-14 15:54

I agree with this. It is confusing. Sysclocks would be unambigious.

Is there a delay now on REP? I thought I read we don't have one sans the complex pipeline. If that is true, I like the form Bill posted last the best. (unable to quote on this device in any sane way)

cgracey · 2014-04-14 16:53

potatohead wrote: »

I agree with this. It is confusing. Sysclocks would be unambigious.

Is there a delay now on REP? I thought I read we don't have one sans the complex pipeline. If that is true, I like the form Bill posted last the best. (unable to quote on this device in any sane way)

REP has no delay slots in this design, since there's no pipeline.

The way we can get INDA/INDB to have all the different pre/post-inc/dec possibilities is to make 8 registers worth of INDx registers:

$1F8 = INDA
$1F9 = INDA++
$1FA = INDA--
$1FB = ++INDA
$1FC = INDB
$1FD = INDB++
$1FE = INDB--
$1FF = ++INDB

MOV INDB++,INDA++ ...same as... MOV $1FC,$1F9

Bill Henning · 2014-04-14 17:08

How about SETINDMOD #bbbaaa

aaa

0xx = use INDA value directly
100 = INDA++
101 = INDA--
110 = ++INDA
110 = --INDA

bbb

0xx = use INDB value directly
100 = INDB++
101 = INDB--
110 = ++INDB
110 = --INDB

That way there is no need to have so many registers...

cgracey wrote: »

REP has no delay slots in this design, since there's no pipeline.

The way we can get INDA/INDB to have all the different pre/post-inc/dec possibilities is to make 8 registers worth of INDx registers:

$1F8 = INDA
$1F9 = INDA++
$1FA = INDA--
$1FB = ++INDA
$1FC = INDB
$1FD = INDB++
$1FE = INDB--
$1FF = ++INDB

MOV INDB++,INDA++ ...same as... MOV $1FC,$1F9

cgracey · 2014-04-14 20:17

Bill Henning wrote: »

How about SETINDMOD #bbbaaa

aaa

0xx = use INDA value directly
100 = INDA++
101 = INDA--
110 = ++INDA
110 = --INDA

bbb

0xx = use INDB value directly
100 = INDB++
101 = INDB--
110 = ++INDB
110 = --INDB

That way there is no need to have so many registers...

But we need to be able to do stuff like JMPSW INDA,++INDA.

Hey, I just realized that we had cog RAM stacks in the Prop2, all along:

JMPSW INDA--,ADR = CALL ADR
JMP ++INDA = RET

This takes less transistors than a hardware LIFO.

jmg · 2014-04-14 20:29

Bill Henning wrote: »

That way there is no need to have so many registers...

Is there a spare opcode bit, that can re-map these to give the decode choices, but not to consume (valuable) register address space with fixed-use-locations ?

Bill Henning · 2014-04-14 20:29

True!

I think you can safely deep-six the four level stack.

Good point about task switching. So how about

How about SETINDMOD #ddd,#sss

ddd - applies to whichever index register is used as the destination

0xx = use INDd value directly, d=A/B
100 = INDd++
101 = INDd--
110 = ++INDd
110 = --INDd

sss - applies to whichever index register is used as the destination

0xx = use INDs value directly, d=A/B
100 = INDs++
101 = INDs--
110 = ++INDs
110 = --INDs

That way there is no need to have so many registers... and the INDMOD stays in effect until next SETINDMOD

cgracey wrote: »

But we need to be able to do stuff like JMPSW INDA,++INDA.

Hey, I just realized that we had cog RAM stacks in the Prop2, all along:

JMPSW INDA--,ADR = CALL ADR
JMP ++INDA = RET

This takes less transistors than a hardware LIFO.

cgracey · 2014-04-14 21:08

jmg wrote: »

Is there a spare opcode bit, that can re-map these to give the decode choices, but not to consume (valuable) register address space with fixed-use-locations ?

The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!

Bill Henning · 2014-04-14 21:12

Umm.. noooooooooooooooooooooooooooooo

With 512 cog locations (minus shadow regs) it is still barely possible to do a 256 entry lookup table for vm's using cog memory, without taking a hub cycle hit.

16 cogs with 512 registers, and tasks, is FAR more useful than 32 cogs with 256 registers.

cgracey wrote: »

The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!

cgracey · 2014-04-14 21:14

Bill Henning wrote: »

True!

I think you can safely deep-six the four level stack.

Good point about task switching. So how about

How about SETINDMOD #ddd,#sss

ddd - applies to whichever index register is used as the destination

0xx = use INDd value directly, d=A/B
100 = INDd++
101 = INDd--
110 = ++INDd
110 = --INDd

sss - applies to whichever index register is used as the destination

0xx = use INDs value directly, d=A/B
100 = INDs++
101 = INDs--
110 = ++INDs
110 = --INDs

That way there is no need to have so many registers... and the INDMOD stays in effect until next SETINDMOD

Good idea on the mode bits. If we gave two registers per INDx, we could pick between two of the preset modes. That would enable stacks.

Bill Henning · 2014-04-14 21:24

Sounds good!

cgracey wrote: »

Good idea on the mode bits. If we gave two registers per INDx, we could pick between two of the preset modes. That would enable stacks.

jmg · 2014-04-14 21:30

cgracey wrote: »

The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!

That sounds too costly, I was thinking more along the lines of 'blurring' the 9b + 9b opcode fields to something like 10b+ 8b in some sparse cases, where that split allows things like dedicated pointers/sfr (special function registers) 'above' usual 512 limit, and now the #Immediate value is smaller, & the source register form is limited to one of the lower 256.
-but it has avoided eating-into general purpose RAM.

Ariba · 2014-04-14 21:31

cgracey wrote: »

But we need to be able to do stuff like JMPSW INDA,++INDA.
....

If this is for task switching, do we then not also need the automatic wrapping inside a table (FIXINDx instructions) ?
I have the feeling this needs a lot of logic gates (comparators, muxes, registers) ?

I think we can always do it with hardcoded task switches instead of a task-list. For example with 3 tasks:

jmpsw task1,task2
   ...
   jmpsw task2,task3
   ...
   jmpsw task3,task1
   ...

Andy

cgracey · 2014-04-14 21:43

Ariba wrote: »
If this is for task switching, do we then not also need the automatic wrapping inside a table (FIXINDx instructions) ?
I have the feeling this needs a lot of logic gates (comparators, muxes, registers) ?

I think we can always do it with hardcoded task switches instead of a task-list. For example with 3 tasks:
jmpsw task1,task2
   ...
   jmpsw task2,task3
   ...
   jmpsw task3,task1
   ...
Andy

I'd like to avoid that wrapping issue, as it is costly. I hope to have all these issues settled tonight. There are many other things that need attention.

cgracey · 2014-04-14 21:46

jmg wrote: »

That sounds too costly, I was thinking more along the lines of 'blurring' the 9b + 9b opcode fields to something like 10b+ 8b in some sparse cases, where that split allows things like dedicated pointers/sfr (special function registers) 'above' usual 512 limit, and now the #Immediate value is smaller, & the source register form is limited to one of the lower 256.
-but it has avoided eating-into general purpose RAM.

I could see that, but how would you know if you were in a 10b + 8b situation?

Ariba · 2014-04-14 22:01

cgracey wrote: »

I'd like to avoid that wrapping issue, as it is costly. I hope to have all these issues settled tonight. There are many other things that need attention.

One of these other things:

I think we need a word-move instruction for SDRAM drivers. We have Nibble and Byte instructions (ROLNIB, GETNIB, SETNB etc.) but nothing like that with word size.
If we want to access 16bit wide SDRAM with half the sysclock rate we need to do the move and shift in one single cycle instruction.

My proposal is a MOVWORD instruction which moves the lower or higher 16bits from S to the lower or higher 16bits of D. The other 16bits in D must not be affected:

movword D,S,#%ds

   movword D,S,#%00   ' D.word0 <- S.word0
   movword D,S,#%01   ' D.word0 <- S.word1
   movword D,S,#%10   ' D.word1 <- S.word0
   movword D,S,#%11   ' D.word1 <- S.word1

Andy

jmg · 2014-04-14 22:21

Ariba wrote: »
One of these other things:

I think we need a word-move instruction for SDRAM drivers. We have Nibble and Byte instructions (ROLNIB, GETNIB, SETNB etc.) but nothing like that with word size.
If we want to access 16bit wide SDRAM with half the sysclock rate we need to do the move and shift in one single cycle instruction.

My proposal is a MOVWORD instruction which moves the lower or higher 16bits from S to the lower or higher 16bits of D. The other 16bits in D must not be affected:
movword D,S,#%ds

   movword D,S,#%00   ' D.word0 <- S.word0
   movword D,S,#%01   ' D.word0 <- S.word1
   movword D,S,#%10   ' D.word1 <- S.word0
   movword D,S,#%11   ' D.word1 <- S.word1
Andy

In common with SDRAM which would want 16 bit and a CS# strobe, LCD parallel interfaces are similar.
Some need 24b, which may be best via the video , others need 16b i8080 bus models.

A useful opcode here could be a double-move, that does 2 x16b moves on a 32 bit register.
With 2 SysClks available, it may be possible to get close to 200MHz bursts ?

cgracey · 2014-04-14 22:31

Ariba wrote: »
One of these other things:

I think we need a word-move instruction for SDRAM drivers. We have Nibble and Byte instructions (ROLNIB, GETNIB, SETNB etc.) but nothing like that with word size.
If we want to access 16bit wide SDRAM with half the sysclock rate we need to do the move and shift in one single cycle instruction.

My proposal is a MOVWORD instruction which moves the lower or higher 16bits from S to the lower or higher 16bits of D. The other 16bits in D must not be affected:
movword D,S,#%ds

   movword D,S,#%00   ' D.word0 <- S.word0
   movword D,S,#%01   ' D.word0 <- S.word1
   movword D,S,#%10   ' D.word1 <- S.word0
   movword D,S,#%11   ' D.word1 <- S.word1
Andy

Thanks for pointing this out. Perhaps GETNIB/BYTE/WORD should perform a ROL function, too.

Cluso99 · 2014-04-14 23:21

re INDA/INDB

Why do we need more than one mode?

When used as a stack, you are either PUSHing (CALLing) or POPing (RETing). So you choose PUSH INDA++ which pushes first, then increments, and POP --INDA which decrements first then pops. This only requires 2 register spaces for INDA. For the occasional times you require just take a copy from the stack you have to POP then PUSH.

If we want 2 stacks, the probability is we will go from each end. Therefore make INDB work the opposite - PUSH --INDB and POP INDB++, and 2 more register spaces.

Now, you will notice that INDA & INDB both have pre-decrement and post-increment. They just get reversed for CALL to RET and also for PUSH to POP. Silicon should be simplified here.

This also gives us the freedom to use INDx++ and --INDx in standard op codes too.

Might INDA & INDB be better called STACKA and STACKB ?

BTW I agree, the 4 level LIFO (which is only 17 address bits + flags) can go.

Cluso99 · 2014-04-14 23:37

NIB/BYTE/WORD

May I suggest...

MOVNIB D,S/#, #0..7
MOVBYTE D,S/#,#0..3
MOVWORD D,S/#,#0..1
where the NIB/BYTE/WORD (rightmost bits) of S/# replace the bits in D as indexed by #0..n
and
GETNIB D,S,#0..7
GETBYTE D,S,#0..3
GETWORD D,S,0..1
where D is left-zero filled and #0..n is an index into S.
These 6 instructions could map nicely to one set of opcode sets.

Can you remind me what ROLNIB/BYTE/WORD does?

My preference when performing a MOVe to cog memory is to use "MOVxxx". Similarly I prefer MOVD, MOVS, MOVI or MOVINST, and MOVCOND.
My preference to use "SETxxx" is for buried registers or setting modes.

Cluso99 · 2014-04-15 00:17

RD CACHE

Thinking about what has been discussed above, I wonder if an alternative could be...

RDQUADC D/#,S/PTRA++/PTRB++ 'reads a quad long into cog ram at a quad boundary (no need for buried DCACHE) and resets the internal OFFSET 4-bit counter.

RDBYTEC D,S/# WC 'reads a byte from the cog where S/# specifies the location D/# used in the RDQUADC, and increments the OFFSET counter +1. "C" set if OFFSET wraps = last byte.

RDWORDC D,S/# WC 'reads a word from the cog where S/# specifies the location D/# used in the RDQUADC, and increments the OFFSET counter +2. "C" set if OFFSET wraps = last word.

Note mixed RDBYTE and RDWORD not supported.
Note if the user gives a different S/# in RDBYTEC or RDWORDC than the D/# used in the RDQUADC then results will be unpredictable (reads from the cog location specified)

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments