The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

RossH · 2014-04-07 16:20

Chip, regarding "RDWORD/WRWORD". I agree we could do without them at a pinch - but they will have to be simulated in software, and each 16 bit Hub access will now take a couple of extra instructions.

From a high level language this is no problem at all - but my gut tells me it is not a good idea for a micro-controller that is still heavily oriented towards embedded applications.

People who have to develop fast and compact embedded code would never use 32 bits where 16 would do - but now every time they do so they will have to mentally juggle the increased access time for such values against the increased code size.

They could easily end up hating this new chip every time they have to do this.

Ross.

EDIT: I see Chip already responded to this issue at post #22.

Electrodude · 2014-04-07 16:21

Roy Eltham wrote: »

I suspect that the issue is that most of the files are in unicode form, and grep is not handling that. I need to find a grep that can read unicode.

Try (untested!)

iconv -f utf-8 [or 16 or whatever] -t iso-8859-1 *.spin | grep -i -w -l instruction

electrodude

Phil Pilgrim (PhiPi) · 2014-04-07 16:28

Roy,

If you can run Perl, you're welcome to use the program I used to analyze my files (attached). The first thing it does is to strip all comments, since things like and and or tend to be rife there. Then it only looks in DAT sections for opcodes. I also found that DAT-resident strings included opcode mimics, so I included the long, word, and byte pseudo-ops to short-circuit any further search in the line. It also handles Unicoded files.

BTW, the last thing it prints is the number of waitpxx ops with immediate operands vs. the total number.

-Phil

jmg · 2014-04-07 16:33

cgracey wrote: »

Does anyone have any data on how other chips apportion VDD pins based on their core current? I know on very high-current chips, they place power pads right in the middle of the die and attach the package directly to them.

This is pasted from a 100 pin CPLD from Atmel (333MHz internal spec)

ATF1508RE has up to 80 bi-directional I/O pins, four dedicated input pins, 1 internal voltage regulator supply input pin (VCCIRI), 6 I/O VCC pins (VCCIOA and VCCIOB), 8 ground pins (GND and 1 internal voltage regulator output pin (VCCIR0).

- note that spec's ~ 30mA @ 100MHz on VccCore, so the VccCore pin count here is going to be lower than you need.

jmg · 2014-04-07 16:38

cgracey wrote: »

As it is shaping up, 70% of the synthesized block will be RAM, with 30% left for logic. This is why we can't go to 1MB - there's no room.

How much RAM does that ~70% give, and what growth in die-edge would be needed to make it to 800x480 LCD numbers ?

Bill Henning · 2014-04-07 16:48

512KB

800x480 @ 8bpp = 384K

If you don't need a back buffer for page flipping, it will work.

jmg wrote: »

How much RAM does that ~70% give, and what growth in die-edge would be needed to make it to 800x480 LCD numbers ?

jmg · 2014-04-07 16:53

Bill Henning wrote: »

512KB
800x480 @ 8bpp = 384K

Hmm - 8bpp ? Does this design include a 256 entry Palette RAM ?
ie How does that 8bpp map onto the DACs chip has mentioned.

cgracey · 2014-04-07 16:58

Seairth wrote: »

I'm really liking the sound of this! I agree that code compatibility isn't necessary. Some additional questions:
What is the expected clock speed?

Will all cogs still be identical?

Will it use the 2-clock or 4-clock design?

Will it keep the ROM lookup tables or CORDIC?

Will it keep the old or new bootstrap?

Will it have the new monitor?

Will it have the debug/trace?

Will the CLUT become AUX?

Will the HUB access be every 32 clock cycles?

Will there be an equivalent to PORT_D?

Will there be 16 HUB locks?

And a few thoughts on the above questions:

For PORT_D, if it would be easy to add a hardwired bus between pairs of cogs (0 and 1, 2 and 3, etc.), this might make it easier to write efficient protocols that require two cogs. Additionally, it might be possible to add software support for an 8-bit (assuming the limited number of I/O pins) external RAM driver that is controllable via PORT_D from the "main" program. In other words, the driver would run in COG 1 and the main program would run in COG 0, commanding it over the hardwired port. The driver would still most likely transfer between external and HUB RAM, which would allow for larger memory models in the "main" program. If I had a choice in the bus architecture, I'd say two 32-bit registers that are cross-coupled such that the first is write-only and the second is read-only (i.e. no need for DIRx).

If there's not enough room for CORDIC in each cog, could you instead make a single instance available in the HUB? Since you will have a 128-bit data bus, it should be possible to start a CORDIC calculation with a single HUBOP (pointing to a block of registers) and read the results on the following hub slot. With this approach, there is obviously the potential for resource conflict. This simplest solution is to leave it up to the programmer to avoid accessing CORDIC from two cogs at the same time.

1) 200MHz
2) yes
3) 2-clock
4) CORDIC
5) new, with authentication
6) yes
7) no, maybe?
8) no CLUT, WAITVID can convey 128 bits now
9) every 16 clocks / 8 instructions
10) Only A and B, for now
11) yes

CORDIC, MUL32X32, DIV64/32, SQRT (maybe) will be in the hub, but pipelined, so nobody has to wait for anybody else, only their turn at the hub.

Cogs could offset each other by 1 clock via WAITCLK to tag-team on the pins.

I don't know if cog-to-cog 32-bit links will be practical.

cgracey · 2014-04-07 17:00

Seairth wrote: »

Also, I'm assuming that the following P2 features will not be ported:
SERDES

INDx

tasks

register remapping

If this were the case, would it be possible to add an extremely simple cooperative multitasking instruction set? I'm thinking something along the following lines:
Single internal TASK register for holding a PC/Z/C.

GETTASK instruction to read TASK.

SETTASK instruction to write TASK.

SWTASK instruction that takes PC+1/Z/C and swaps it with whatever is in TASK.

With just SETTASK and SWTASK, it would be possible to write drivers with "concurrent" read/write threads. With GETTASK, more complex schedulers could be developed. No, it's not as efficient as interleaved tasking, but it should be very little increase in complexity and circuitry for a significant increase in usability over the current P1 approach(es).

Right, no serdes, tasks, register remapping - though there may be some INDx. Tasks in a non-pipelined architecture are almost trivial to implement, but I don't want to go there yet. Having multiple tasks IS a lot of fun and makes some apps possible to do in one cog.

Bill Henning · 2014-04-07 17:07

Nope, all the nice palette stuff is gone. I will really miss the 4bpp mode, and palette modes. Unfortunately that needs AUX (formerly CLUT) and the new video engine... many gates

On Morpheus, I use RRRGGGBB, works very well. A little external logic would allow RRGGBBII.

jmg wrote: »

Hmm - 8bpp ? Does this design include a 256 entry Palette RAM ?
ie How does that 8bpp map onto the DACs chip has mentioned.

cgracey · 2014-04-07 17:10

jmg wrote: »

How much RAM does that ~70% give, and what growth in die-edge would be needed to make it to 800x480 LCD numbers ?

That's 512KB. Each 128KB takes 5.7 square mm.

How much ram is needed for the LCD's?

Bill Henning · 2014-04-07 17:12

Chip,

- INDx would be EXTREMELY useful, especially if it has the same modes as P2 (so it can be used as a stack, or FIFO), Minimum 2 please, four would incredible.

- Personally, I'd prefer 32 cogs, with my hub slot mapping table. Failing that, task capability would be great

(putting on kevlar suit)

- I don't dare ask for the PTRA/PTRB support... even with the kevlar suit on ... although it is great for compiler support

cgracey wrote: »

Right, no serdes, tasks, register remapping - though there may be some INDx. Tasks in a non-pipelined architecture are almost trivial to implement, but I don't want to go there yet. Having multiple tasks IS a lot of fun and makes some apps possible to do in one cog.

W9GFO · 2014-04-07 17:12

cgracey wrote: »

This thread is about the new chip we are going to build in the 180nm process...

Ken Gracey · 2014-04-07 17:13

cgracey wrote: »

. . .if we had 32 cogs.

Exercising restraint can be so difficult and I encourage all of us to set limits now and design towards them. This morning we had only 16 cogs.

You guys want a wet blanket? I'm here. We might be starting again, but we're in the home stretch, so let's finish this one!

Ken Gracey

localroger · 2014-04-07 17:15

Just chiming in to say I like the idea of going ahead with this chip. P2 was getting out of hand and this looks much more do-able and still a quantum jump over what we have now. Will be following more closely than I was following the P2 thread lately...

Seairth · 2014-04-07 17:19

Bill Henning wrote: »

I don't dare ask for the PTRA/PTRB support... even with the kevlar suit on ... although it is great for compiler support

I thought PTRx was necessary in order to access larger HUB memory spaces.

PropGuy2 · 2014-04-07 17:21

LOVE IT... Easy. Simple. Low power(?). Lots of cogs. Lots of memory, Lots of pins. No feature creep. P L E A S E.

jmg · 2014-04-07 17:27

cgracey wrote: »

That's 512KB. Each 128KB takes 5.7 square mm.

How much ram is needed for the LCD's?

In purely raw pixel storage, it is like this - of course, some will map to DACs easier than others.

8*800*480/8 = 384000
9*800*480/8 = 432000
10*800*480/8 = 480000
11*800*480/8 = 528000
12*800*480/8 = 576000
13*800*480/8 = 624000
14*800*480/8 = 672000
15*800*480/8 = 720000
16*800*480/8 = 768000
17*800*480/8 = 816000
18*800*480/8 = 864000
19*800*480/8 = 912000
20*800*480/8 = 960000
21*800*480/8 = 1008000
22*800*480/8 = 1056000
23*800*480/8 = 1104000
24*800*480/8 = 1152000

The reference devices like SSD1963, support 16,18,24 bit pixel modes, and slave BUS widths of 8,9,12,16,18,24
Where they have < 24bbp, I think they left justify in a 24b output field.

jmg · 2014-04-07 17:30

Ken Gracey wrote: »

Exercising restraint can be so difficult and I encourage all of us to set limits now and design towards them. This morning we had only 16 cogs.

I don't think the 32 was serious, just a for-example point.

We still need an OnSemi Power/Speed Simulation pass on this, before even 16 COGS & 200MHz are actually confirmed as within the Power/Process envelopes.

jmg · 2014-04-07 17:34

Bill Henning wrote: »

- Personally, I'd prefer 32 cogs, with my hub slot mapping table. Failing that, task capability would be great

32 COGs is unlikely to fit the power envelope, and even 16 still needs Sim confirmation.

Unused COGs burn quite a lot of die area, so I think simple tasking should be checked, once an OnSemi Sim confirms how many COGs can 'stay cool' inside the package.

Roy Eltham · 2014-04-07 17:36

I found a program that deals with the unicode, and provides a bit more data:

Here's the updated version of my original list:

and - 1307
or - 1195
waitcnt - 817
add - 551
test - 509
call - 428
cogstop - 378
mov - 377
jmp - 377
rdlong - 324
Sub - 303
shl - 298
wrlong - 298
djnz - 295
cmp - 259
shr - 256
max - 235
ret - 208
min - 189
xor - 186
andn - 162
muxc - 161
movs - 154
rdbyte - 150
movd - 137
rcr - 122
wrbyte - 118
neg - 113
ror - 109
cmps - 98
rcl - 96
waitpeq - 94
rol - 87
adds - 82
nop - 82
rdword - 70
muxnc - 68
jmpret - 68
waitpne - 66
tjz - 62
cmpsub - 59
sar - 58
rev - 54
abs - 51
cogid - 47
locknew - 45
tjnz - 44
lockclr - 43
lockset - 42
waitvid - 42
wrword - 36
muxnz - 34
mins - 33
negc - 29
maxs - 29
lockret - 27
negnz - 25
muxz -24
coginit - 23
addx - 23
clkset - 22
sumc - 21
subs - 18
sumnc - 16
sumnz - 10
subx - 7
negnc - 4
addabs - 4
absneg - 3
sumz - 3
negz - 2
testn - 2
subsx - 1
subabs - 1
hubop - 1
cmpsx - 1
addsx - 1

Notes: ClusoDebugger_276.spin contains every instruction, so you can effectively subtract 1 from all of these. I didn't exclude comments, so and and or are artificially high, a few other things are also higher because of spin keywords.

Here's the number of hits total for each instruction:

   21534  and
   16983  or
   13011  mov
    6314  Add
    5656  call
    5035  jmp
    3486  test
    3122  waitcnt
    2182  rdlong
    1815  shl
    1692  Sub
    1565  cmp
    1503  shr
    1473  djnz
    1345  ret
    1086  Wrlong
     984  andn
     898  xor
     724  jmpret
     670  movs
     646  muxc
     632  Max
     494  rol
     464  rdbyte
     464  cogstop
     451  waitvid
     442  movd
     427  rcr
     414  nop
     385  cmps
     355  neg
     320  wrbyte
     308  ror
     290  muxnc
     287  rcl
     280  Min
     242  movi
     216  rdword
     207  waitpeq
     173  SAR
     169  cmpsub
     163  abs
     159  tjz
     146  TJNZ
     143  waitpne
     130  muxnz
     126  adds
     124  lockclr
     121  lockset
     108  cogid
      89  muxz
      88  Rev
      81  subs
      70  wrword
      69  ADDX
      57  CLKSET
      55  sumc
      55  locknew
      51  negc
      49  mins
      41  sumnc
      40  maxs
      35  lockret
      35  absneg
      33  negnz
      33  coginit
      10  sumnz
       7  subx
       6  negnc
       6  addabs
       4  negz
       3  testn
       3  sumz
       1  subsx
       1  subabs
       1  hubop
       1  cmpsx
       1  addsx

Phil, I will try downloading and using your perl program now.

Bill Henning · 2014-04-07 17:42

Stricly speaking, its not necessary.

It sure does speed it up, and reduces operations needed greatly for compiled code stack operations.

Seairth wrote: »

I thought PTRx was necessary in order to access larger HUB memory spaces.

Bill Henning · 2014-04-07 17:44

jmg wrote: »

32 COGs is unlikely to fit the power envelope, and even 16 still needs Sim confirmation.

Unused COGs burn quite a lot of die area, so I think simple tasking should be checked, once an OnSemi Sim confirms how many COGs can 'stay cool' inside the package.

Simple tasking would mean no need for 32 cogs. Works for me. I know, feature creep.

Rayman · 2014-04-07 17:45

So I guess we could do 8-bit fullscreen WVGA using a 384kB pixel buffer...
Guess we'd have one cog half full with a 256 long CLUT and the rest of it would just push the pixel buffer out the DAC...

Roy Eltham · 2014-04-07 17:46

Here is the output from Phil's perl script run against my copy of the obex (as of august 2012):

Opcode frequencies for 88602 lines of PASM code.
_______________________


byte:     22989
mov:      12233
long:      9398
add:       5024
jmp:       4203
call:      4092
word:      3944
or:        2508
test:      2296
rdlong:    1775
shl:       1718
cmp:       1523
sub:       1464
shr:       1428
djnz:      1391
and:       1187
wrlong:     960
andn:       860
xor:        758
movs:       658
jmpret:     574
waitcnt:    490
muxc:       487
rol:        449
rdbyte:     448
rcr:        426
movd:       426
cmps:       383
ret:        371
waitvid:    349
wrbyte:     306
neg:        276
rcl:        271
ror:        259
movi:       231
nop:        219
rdword:     209
cmpsub:     167
tjz:        153
sar:        149
tjnz:       143
muxnz:      106
abs:        104
muxnc:       94
max:         80
waitpeq:     77
waitpne:     72
muxz:        71
subs:        70
wrword:      66
addx:        66
min:         58
sumc:        53
cogstop:     51
cogid:       50
negc:        50
mins:        45
sumnc:       40
maxs:        39
rev:         37
absneg:      32
negnz:       32
adds:        30
clkset:      29
coginit:     13
sumnz:        9
lockclr:      7
subx:         6
negnc:        5
lockret:      4
lockset:      3
negz:         3
addabs:       2
sumz:         2
locknew:      1
hubop:        0
subabs:       0
cmpsx:        0
cmpx:         0
addsx:        0
subsx:        0

This represents actual PASM usage in the files, eliminating comments, strings, etc, and only looking in DAT sections. Thanks Phil!

W9GFO · 2014-04-07 17:56

Ken Gracey wrote: »

Exercising restraint can be so difficult and I encourage all of us to set limits now and design towards them. This morning we had only 16 cogs.

You guys want a wet blanket? I'm here. We might be starting again, but we're in the home stretch, so let's finish this one!

The wet blanket is a good start, you might want to get the firehose ready though. :-)

Kerry S · 2014-04-07 18:00

Bill Henning wrote: »

Sorry Kerry,

Bad news: I goofed reading the pinout pic.

Good news: 80 I/O's may still be possible

Ok... So then we would have 36 I/O available after SDRAM. With 4 used for VGA we are left with 32. Same as what I have now to work with (P1). I would have to give up my 4 hard inputs (direct to Prop) to get the Mouse and Keyboard serial ports that I am now getting from the grafted Raspberry Pi. Not optimal, but doable.

As for memory, if you are planning on this to have SDRAM typically, can you not (don't hang me) make the I/O pins for that interface just digital and drop the analog from them? That would free up area for more memory for the LCD/VGA guys. If you don't need the SDRAM they would still be available for regular digital I/O applications. Would there really be a practical use for 80 analog pins on one chip?

Even with the extra I/O 16 cogs is fine. Please don't give Ken a stroke! He has been very good with our insanity up til now and we need him to be 100% working his marketing magic.

dr hydra · 2014-04-07 18:03

Would it be possible to set the number of cogs that can access the hub memory...therefore increasing the bandwidth to hub memory...

etc have a setting so all 16 cogs access memory..one to set it to 8 cogs...all the way down to one cog full access...increasing bandwidth with each setting....that way hub exec can be done at varing rates...the best of both worlds.

jmg · 2014-04-07 18:06

Rayman wrote: »

So I guess we could do 8-bit fullscreen WVGA using a 384kB pixel buffer...
Guess we'd have one cog half full with a 256 long CLUT and the rest of it would just push the pixel buffer out the DAC...

Maybe this is a case for the simple tasking ? Only instead of 2 pgms slicing, one is the code and the other is the Video-Gen using the COG memory instead of its own local memory.
Preserves RAM which is the die costly item here.

Seairth · 2014-04-07 18:08

Bill Henning wrote: »

Simple tasking would mean no need for 32 cogs. Works for me. I know, feature creep.

That's why I suggested the minimal cooperative approach. This approach require zero modification of the pipeline or instruction processing, while enabling what I imagine to be the biggest use case: a single cog with separate I/O read and write threads. With 16 cogs, I see much less need for the 4-task approach in P2. This is a KISS solution that should have the least impact on the new chip.

By the way, what nickname are we giving this thing?

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments