The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Bill Henning · 2014-04-07 21:20

I already proposed a hub slot mapping mechanism that will do even more than you ask for

Chip's P1+ design runs the hub at 200Mhz.

I proposed adding a table, which specifies which cog gets access as follows (using Spin syntax to demo)

word set_slot[128]

Every clock cycle the cog specified by

set_slot[cnt & $7F]

gets the hub.

By default, set_slot is initialized with eight repetitions of 0..31 - so the default is every cog gets a hub slot every eight instruction cycles.

This allows 100% deterministic slot assignments, for both heavy hub users (video) and light (serial port).

dr hydra wrote: »

Would it be possible to set the number of cogs that can access the hub memory...therefore increasing the bandwidth to hub memory...

etc have a setting so all 16 cogs access memory..one to set it to 8 cogs...all the way down to one cog full access...increasing bandwidth with each setting....that way hub exec can be done at varing rates...the best of both worlds.

Bill Henning · 2014-04-07 21:23

Sorry, I intensely dislike cooperative tasking. Wastes a lot of cog space, have to insert manual yields.

Seairth wrote: »

That's why I suggested the minimal cooperative approach. This approach require zero modification of the pipeline or instruction processing, while enabling what I imagine to be the biggest use case: a single cog with separate I/O read and write threads. With 16 cogs, I see much less need for the 4-task approach in P2. This is a KISS solution that should have the least impact on the new chip.

By the way, what nickname are we giving this thing?

JRetSapDoog · 2014-04-07 21:26

cgracey wrote: »

What would really blow it wide open would be to have a 256-bit hub data path, so that each cog could do an 8-instruction fetch every 8 instructions. That would have the effect of jacking the power up quite a bit, I'm afraid.

Consistent with my prior "pin features down early" post and avoid unnecessary limitations, I was just going to ask about this. If 128 bits is good, would 256 be better? Obviously, you love the way the HUB memory is laid out now and "pours" into the COGS four longs at a time. Maybe the power constraints could be overcome to allow for pouring 8. It sounds like this is in the back of your mind (or maybe the front), though I know you want to manage power conservatively from Day 1.

It's worth reconsidering, though, even if just to keep the word "WIDE" (I love the sound of that. Congrats to whoever coined it). Caution: EXTRA WIDE LOAD!

Heater. · 2014-04-07 21:38

Cooperative multi-tasking is a dog.

Example, a full duplex uart rx and tx threads.

You can insert a "yield/suspend" point in their thread loops. But then you find the latency in responding to incoming edges is too low and the out going edges are very jittery.

OK add more "yield/suspend" points in those loops. Now you are pushing up the instruction count, wasting valuable COG space. Not only that you are slowing things down. Those yield instructions take time.

In the extreme you have a "yield/suspend" in between every actual useful instruction. Now you have:
a) doubled the size of your code and
b) Halved it's performance.

Hardware scheduled threads, an instructions from each thread executed alternately, is much better because:
a) Code size is minimized.
b) Performance is doubled, due to a)
c) Latency in response to incoming edges and the tx is clock is minimized, a lot less jitter and more timing accuracy.

Anyway, if you want cooperative threading we already have it with jmpret don't we? As used in FullDuplexSerial on the P1.

Cluso99 · 2014-04-07 21:46

cgracey wrote: »

One aspect of having 4-long hub transfers is that with a few simple address tags (TLB, David?) we could direct out-of-cog addresses into 4-long register blocks which are serving as instruction caches. Part of the cog register RAM becomes the cache! No cache-line flipflops and mux's needed!

I think hub exec is going to happen, because it won't take much. What would really blow it wide open would be to have a 256-bit hub data path, so that each cog could do an 8-instruction fetch every 8 instructions. That would have the effect of jacking the power up quite a bit, I'm afraid. All cogs could run at 100% speed from the hub without branching or hub accesses.

I think Cluso thought up this possibility on the Prop2 effort.

Fantastic!

HUBEXEC in any form will be appreciated by us all. Even better if it can use a block of cog ram. Happy to dedicate 1-4 4 long registers to an instruction cache.

While the 256-bit wide hub would be nicer, we can live with 128-bit quad hub. While fetching from hub every 4 hubexec instructions, 50% speed is fine by me.
If you implemented a hub slot mechanism, we could crank this up to 100% by giving 2x slots.

Since you seem to think a pair (1x R, 1x W) 32bit registers between adjacent cogs is not likely to be feasible, what about a pseudo R/W Port C & D between cogs?
All cogs would OR their 64 bit register to the 64bit wide bus. All cogs can read the 64 bit bus. Access is via 2 32bit ports. No DIR bit required. None of the Port D complexities of P2.

Are you going to change the instruction decoding to more like the P2 ? ie are we likely to lose the NR bit ?

cgracey · 2014-04-07 22:00

Cluso99 wrote: »

Are you going to change the instruction decoding to more like the P2 ? ie are we likely to lose the NR bit ?

I think it will remain like Prop1. It's shaping up that we'll have very few instructions, so no pressure on the 32-bit opcode patterns.

The thing about 256-bit wide hub access is that it would double the RAM power on hub exec (and 2x the speed), and require double the mux's on the instruction, D, and S busses. It all comes down to power, mainly, and area less.

It wouldn't be that complex to switch to 256 bits at the end. We'll need to see what power looks like.

potatohead · 2014-04-07 22:07

Let's call it P1+

RossH · 2014-04-07 22:08

Cluso99 wrote: »

Hubexec is really important to GCC and Catalina C, and to other HL languages.

Hi Cluso,

Apart from the power saving, your Hubexec appears to adds no functional or speed improvements over existing LMM code.

Is the power saving worth the additional complexity?

Ross.

Cluso99 · 2014-04-07 22:10

cgracey wrote: »

I think it will remain like Prop1. It's shaping up that we'll have very few instructions, so no pressure on the 32-bit opcode patterns.

The thing about 256-bit wide hub access is that it would double the RAM power on hub exec (and 2x the speed), and require double the mux's on the instruction, D, and S busses. It all comes down to power, mainly, and area less.

It wouldn't be that complex to switch to 256 bits at the end. We'll need to see what power looks like.

Thanks Chip.

Current P1 style instructions is great. In that case, you could use the NR bit for PORT A/B selection for WAITPxx.

cgracey · 2014-04-07 22:16

Cluso99 wrote: »

Thanks Chip.

Current P1 style instructions is great. In that case, you could use the NR bit for PORT A/B selection for WAITPxx.

Yes! That would be much better.

Cluso99 · 2014-04-07 22:34

RossH wrote: »

Hi Cluso,

Apart from the power saving, your Hubexec appears to adds no functional or speed improvements over existing LMM code.

Is the power saving worth the additional complexity?

Ross.

It's now moot, but yes. It simplifies the programming because JMPX/CALLX/RETX can now address the whole hub (saves hub space, faster, and simpler), and can intermingle hub and cog instructions - ie call a cog routine for the extra speed.
The additional complexity for the Basic mode was extremely simple.

RossH · 2014-04-07 22:38

Cluso99 wrote: »

It's now moot, but yes.

Why moot? Did I miss something again? Gosh it's hard to keep up!

Ross.

cgracey · 2014-04-07 22:41

RossH wrote: »

Why moot? Did I miss something again? Gosh it's hard to keep up!

Ross.

There needs to be some forum software that lets you see activity in some 3D fashion, with the ability to order all posts in time, while seeing their separate threads. This current medium is way better than nothing, but inadequate for the recent rate of chatter. Too much bifurcation going on.

Cluso99 · 2014-04-07 22:43

RossH wrote: »

Why moot? Did I miss something again? Gosh it's hard to keep up!

Ross.

Yep

cgracey wrote: »

One aspect of having 4-long hub transfers is that with a few simple address tags (TLB, David?) we could direct out-of-cog addresses into 4-long register blocks which are serving as instruction caches. Part of the cog register RAM becomes the cache! No cache-line flipflops and mux's needed!

I think hub exec is going to happen, because it won't take much. What would really blow it wide open would be to have a 256-bit hub data path, so that each cog could do an 8-instruction fetch every 8 instructions. That would have the effect of jacking the power up quite a bit, I'm afraid. All cogs could run at 100% speed from the hub without branching or hub accesses.

I think Cluso thought up this possibility on the Prop2 effort.

Cluso99 · 2014-04-07 22:45

cgracey wrote: »

There needs to be some forum software that lets you see activity in some 3D fashion, with the ability to order all posts in time, while seeing their separate threads. This current medium is way better than nothing, but inadequate for the recent rate of chatter. Too much bifurcation going on.

Too true!
I have done virtually nothing but catch up and then keep up here in the past 5 hours

RossH · 2014-04-07 22:49

Cluso99 wrote: »

Yep

Cluso and Chip,

Great! The idea of multi-long hub transfers was always my favored form of Hub Exec, and was what I was going to use in my original port of Catalina to the P2 when I first gave it some thought a year or two ago.

Very easy to understand, and very easy to implement in software!

Cluso99 · 2014-04-07 23:14

Chip,
What are the chances of getting an FPGA image (without the extras) so that we can start playing ?

As long as we can boot and load, we don't need the monitor yet. In fact we can do that for you if you like - only need to convert the older P2 one back into P1 instructions.

We can compile with PropTool for now (or an old pnut ?) as long as we can load our new P16B.

No need for docs either.

jmg · 2014-04-07 23:44

jmg wrote: »

Rayman wrote:

So I guess we could do 8-bit fullscreen WVGA using a 384kB pixel buffer...
Guess we'd have one cog half full with a 256 long CLUT and the rest of it would just push the pixel buffer out the DAC...

Maybe this is a case for the simple tasking ? Only instead of 2 pgms slicing, one is the code and the other is the Video-Gen using the COG memory instead of its own local memory.
Preserves RAM which is the die costly item here.

I'll relabel this a little with a more descriptive term, that better reflects what actually happens here

Video (CLUT) Memory Sharing

borrowing a term from the PC world of cheap systems where they have one memory array and Code & Video share the bus.
Saves the die area of a separate CLUT, but shares COG RAM and slots (50%) to do this.

The video HW shifts each 8b (/4b/2b/1b?) pixel and uses that as the CLUT index, and sends that 32 bit read, split to the DAC or Direct to pins.

cgracey · 2014-04-07 23:45

Cluso99 wrote: »

Chip,
What are the chances of getting an FPGA image (without the extras) so that we can start playing ?

As long as we can boot and load, we don't need the monitor yet. In fact we can do that for you if you like - only need to convert the older P2 one back into P1 instructions.

We can compile with PropTool for now (or an old pnut ?) as long as we can load our new P16B.

No need for docs either.

It will probably take me a few weeks to get to the point where we have an FPGA image. Right now, I'm figuring how what instructions we need, in order to get some handle on the opcodes.

RossH · 2014-04-08 00:18

jmg wrote: »

I'll relabel this a little with a more descriptive term, that better reflects what actually happens here

Video (CLUT) Memory Sharing

borrowing a term from the PC world of cheap systems where they have one memory array and Code & Video share the bus.
Saves the die area of a separate CLUT, but shares COG RAM and slots (50%) to do this.

The video HW shifts each 8b (/4b/2b/1b?) pixel and uses that as the CLUT index, and sends that 32 bit read, split to the DAC or Direct to pins.

If my information is still current (and who knows, since it is already minutes old!) then I believe Chip has said no CLUT and no tasks.

Ross.

cgracey · 2014-04-08 00:24

RossH wrote: »

If my information is still current (and who knows, since it is already minutes old!) then I believe Chip has said no CLUT and no tasks.

Ross.

I know that a CLUT is really important for 8-bit pixel data, so that it can be expanded into 8:8:8 RGB. Even a tiny CLUT for 4-bit pixel data would be nice. We'll see.

Tasks are easy to implement. Having a simplified time-slot scheme would be best for this chip. Tubular had a great idea for making variable-length slot patterns that was implemented in Prop2, but that is too complex for this chip.

Brian Fairchild · 2014-04-08 00:36

re: More than 64 I/O pins

Sorry guys but you're using the wrong chip. There will always be a desire to make a universal chip but at some point you have to decide between microcontroller, microprocessor and SOC. In my book, a microcontroller is a single self-contained system that contains enough memory for its target market. 64 I/O is more than enough for a microcontroller.

Adding more and more I/O pins simply makes the P1+ a P2 and should be resisted.

re: SDRAM

same as more I/O. You need SDRAM? Buy a microprocessor not a microcontroller.

re: in COG multitasking

Nooooooooooooooooooo. Keep it simple. A COG is a task. You want another asynchronous task? Give it its own COG.

Brian Fairchild · 2014-04-08 00:43

Bill Henning wrote: »

Could you find the link for me? I am not sure which post you are referring to. Glad to re-calculate for you after that.

Hi Bill,

in the Possible simple HUBEXEC method for P16+X32B - discussion I wrote...

So, just to get this straight in my head...

At 200MHz main clock, a P1E COG would run at 100MIPS from COG RAM and eight of them could run at 25MIPS each from HUBRAM using your slot allocator mechanism.

to which you replied...

Yes.

..which lead me to say...

Good, glad I've got it.

So if a 16 COG P1+ appears I can have 8 COGs acting as very capable intelligent peripherals accessing HUB RAM at 1/128 to exchange data and 8 COGs running out of HUB at 15/128 or around 23.5 MIPS each! So that's a chip with a pile of intelligent peripherals and the equivalent of 8 typical 8-bit processors (except they're 32-bit) all in one package.

My question is now...."if your new proposal was implemented, given the above scenario, what speed increase would we see in the COGs running from HUB which currently run at 23.5MIPS?"

RossH · 2014-04-08 00:50

Brian Fairchild wrote: »

re: More than 64 I/O pins

Sorry guys but you're using the wrong chip. There will always be a desire to make a universal chip but at some point you have to decide between microcontroller, microprocessor and SOC. In my book, a microcontroller is a single self-contained system that contains enough memory for its target market. 64 I/O is more than enough for a microcontroller.

Adding more and more I/O pins simply makes the P1+ a P2 and should be resisted.

re: SDRAM

same as more I/O. You need SDRAM? Buy a microprocessor not a microcontroller.

re: in COG multitasking

Nooooooooooooooooooo. Keep it simple. A COG is a task. You want another asynchronous task? Give it its own COG.

Much as I really like the idea of external SDRAM, I agree - best to save that kind of thing for the P3.

But I think it is hopeless - in the post just above this one, Chip is already reconsidering the CLUT!

Ross.

Heater. · 2014-04-08 00:54

I tend to agree.

This obsession with external RAM seems very miss-guided for this device. If you need a huge program/data space get a chip that is architected to do that and will do it much better and faster.

The desire for external RAM leads to the need for dozens more pins. And before you know it everything has has spiralled far way from the original vision.

Problem is people will make comparisons to the STM32F4 the smallest of which has 96K RAM and up to 512K FLASH. And that a micro-controller not a microprocessor.

Multitasking: I don't know. It was very valuable when only 8 COGs were on offer. Not so much with 16 or 32. But still nice if it's not horribly big, complex and power hungry to implement and is easy to use.

cgracey · 2014-04-08 01:01

Heater. wrote: »

I tend to agree.

This obsession with external RAM seems very miss-guided for this device. If you need a huge program/data space get a chip that is architected to do that and will do it much better and faster.

The desire for external RAM leads to the need for dozens more pins. And before you know it everything has has spiralled far way from the original vision.

Problem is people will make comparisons to the STM32F4 the smallest of which has 96K RAM and up to 512K FLASH. And that a micro-controller not a microprocessor.

Multitasking: I don't know. It was very valuable when only 8 COGs were on offer. Not so much with 16 or 32. But still nice if it's not horribly big, complex and power hungry to implement and is easy to use.

SDRAM can be used with this chip without any special peripherals in the cogs. You would set a pin for NCO mode so that it's generating a 100MHz square wave (updating at 200MHz). You would then have one instruction period in the cog to match its clock period. You might need two cogs, but one would issue SDRAM commands while the other fed/gathered the data.

Multitasking can still be a good thing if kept to bare bones (no multiples of PTRs and peripherals like REP). It can enable a single cog program to have a few concurrent threads, which is necessary for something like the ROM monitor. Well, you invented it, so you should know.

koehler · 2014-04-08 01:04

ctwardell wrote: »

I like the idea of having some form of HUBEXEC, if it gets designed in public that's fine, if it happens behind closed doors that's fine as well.

C.W.

Still 8 pages out from the end of thread, however I agree.
However not just from a technical standpoint.

Adding this allows the Prop to FINALLY be able to say that you can have 1 Core with the equivelent of 2-512K of memory.
That is something that will make people take do a double-take, and hence increase potential adoption.
This far exceeds the majority of uC and even many, many ARM sku's.

cgracey · 2014-04-08 01:19

koehler wrote: »

Still 8 pages out from the end of thread, however I agree.
However not just from a technical standpoint.

Adding this allows the Prop to FINALLY be able to say that you can have 1 Core with the equivelent of 2-512K of memory.
That is something that will make people take do a double-take, and hence increase potential adoption.
This far exceeds the majority of uC and even many, many ARM sku's.

Thanks for pointing that out.

Brian Fairchild · 2014-04-08 01:24

Heater. wrote: »

...people will make comparisons to the STM32F4 the smallest of which has 96K RAM and up to 512K FLASH. And that a micro-controller not a microprocessor.

True, although with a headline number of 512k the P1+ shouldn't fall too short. Plus, with the P1+ you can trade off program against data.

At the risk of getting all nostalgic, my fist microcontroller work was with 8751s. 4k of UV EPROM and all of 128 bytes of RAM. Program, burn, debug, erase, repeat.

Heater. · 2014-04-08 02:03

Chip,

I have no worries about people wanting attach SDRAM using the pins, counters and other features available. I'm sure I will be wanting to use that myself.

I do worry about SDRAM driving the need for HDL features and bloat that slowly get us back to the "just another wafer thin" feature that turns the design into an exploding Mr Creosote (again).

I do worry about SDRAM driving the need for lots more pins which change the package to something unusable for mere mortals and leads to the chip only being available on a carrier board that includes an SDRAM which is mostly redundant.

Yep, I'm all for the multi-tasking, bare bones as you say.

If only I could claim to have invented it. I simply borrowed the idea from David May and his XMOS devices and delivered it here.

@Brian,

Yep, The 512K headline is quite fine, especially if hubexec works.

We love nostalgia around here. The 8751 was great. Amazing to think they were a couple of hundred pounds each in the ceramic packages we were using at the time.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments