The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Heater. · 2014-05-01 14:29

jmg,

That explains why I cannot follow or fathom the 'Logic' behind the reaction of not giving users control.

It is not derived from technical, it is based on someone else's perceived value, like art, or music.
Even there however, I choose my own music and art.

Yes. You are starting to understand.

It's why BASIC was more widely used that ALGOL or FORTRAN.

Why C won out over the obviously superior Pascal and Ada.

Why we all use the x86 architecture instead of the 68000.

A thing that is available now, simple, cheap and "works good enough" trumps every theoretically better way.

To be less flippant, yes it is a "value" judgement. In a very concrete use of the word "value".

Is it more profitable at the end of the day to make the PII more easily usable by the majority with least surprises. Or, is it more profitable to enable a few as yet unidentified applications run faster?

As I have said before, that is for Chip and Ken to decide.

Seairth · 2014-05-01 14:31

Roy asked earlier for a non-theoretical example of where extra hub cycles would actually be useful. I particularly see a need for this with an external RAM driver. The issue isn't with the transfer of the data, which will likely be "fast enough" with the 128-bit bus. The issue is with the latency between the ram cog and the requesting cog(s).

Right now, this would be a typical read routine:

Requester performs WRQUAD to initiate a ram read request (ram address to read from, hub address to store at, number of bytes/longs/etc).
1 to 15 clock cycles later, the ram cog receives the request by polling with RDQUAD.
The ram cog performs the operation on the external ram and stored the results to the hub. As I said, even with the current modulo 16 approach, this is likely "fast enough".
The ram cog performs a final WRxxx to indicate that the read request has been completed.
15 to 1 clock cycles later, the requesting cog detects the response by polling with RDxxx
The requesting cog performs one or more RDxxx to fetch the data. Again, "fast enough".

In the above scenario, steps 2 and 5 always result in 16 clock cycles waiting for the hub access window. Now, if more than one cog needs to communicate with the ram cog, a hub-based queuing mechanism has to be added which itself will require several back-to-back hub accesses to manipulate. The latency incurred by waiting for the the hub access window can end up becoming a significant percentage of the overall operation.

More generally, I think this issue would be true for just about any shared, high-bandwidth, external resource that uses one cog as a driver and multiple other cogs to interact with it via HUBOPs with the "driver cog". The reason that I qualified it with "high-bandwidth" is because the relative percentage of time devoted to hub accesses for low-bandwidth resources (CANBUS, low-baud serial, I2C, etc.) is likely going to be lower than with the high-bandwidth resources (high-speed SPI, QSPI, RAM, Prop-to-Prop I/O, etc.). Nevertheless, there is still a certain amount of overhead that results from having multiple cogs coordinate access to shared resources. This overhead gets significantly worse for every additional cog that is involved. And, while that would still hold true for *any* access scheme, I think an improved access scheme would significantly reduce the latency aspect and therefor allow a much better overall performance than with the current approach.

As a side note, when I looked through a variety of "driver" objects in the OBEX, every single one I looked at used a single-entry command queue, which meant that none of those could be easily used by multiple cogs at the same time. What's unclear to me is whether this is a natural result of the limitations of the P1 (whatever that may be) or whether it's actually really rare to have more than one cog communicating with a "driver" at any given time.

Seairth · 2014-05-01 14:32

Several people here have commented that alternative hub access schemes are a bad idea because it would cause a loss of determinism. It is interesting to me that many of these same people argue in favor of hardware multitasking, which also causes a loss of determinism. In defense of the multitasking, the most common response to the non-determinism concern has been "don't use it if you need determinism." And I agree. That's a reasonable argument. And it happens to be the same argument that several (all?) of the proponents of alternative hub accesses give: don't use it if you need determinism. It seems unreasonable to argue against one form of non-deterministic processing and simultaneously argue in favor of the other. Likewise, it seems unreasonable to dismiss the "don't use it" argument for one form and simultaneously agree with it for the other.

Oh, and incidentally, should hardware multitasking make it back into the P1+, I think that would be a very good reason to add an advanced hub access mechanism. This would significantly reduce task contention for hub access, as well as significantly reduce stalls (which affect every task).

Seairth · 2014-05-01 14:37

Chip,

Out of curiosity, how many clock cycles will the hub instructions take to complete? For instance, on the P1, it was something like 7 cycles and on the P2 it was 3 cycles (for reads). How many cycles they take for the P1+?

Heater. · 2014-05-01 14:37

Todd,

"mooch"

There is a proposal that a program running in a COG should be able to use a "HUB slot" to read and write RAM at whatever time it likes, provided that particular time in the HUB cycle is not actually needed by some other COG. If the that HUB slot is used by the other COG in the normal way then our first guy does not get it.

Kind of like "mooching" for a free lunch, but it's no disaster if you fail.

Heater. · 2014-05-01 14:47

Seairth,

Several people here have commented that alternative hub access schemes are a bad idea because it would cause a loss of determinism. It is interesting to me that many of these same people argue in favor of hardware multitasking...

One of those would be me.

...which also causes a loss of determinism.

No. It does not.

The unit of functionality here is the program running in a COG. The idea is that no matter what happens in other COGS "this COG" runs the same. And no matter what "this COG" does the others always run the same.

That is determinism.

If "this COG" has one thread or many is of no consequence. No more so than the current FullDuplexSerial driver that has two coroutines. That has no effect on the other guys out there.

jmg · 2014-05-01 14:55

Roy Eltham wrote: »

... So we are talking about hubexec being able to go ~50 MIPs verses ~100 MIPs barring any other hub accesses, branches, or waits.

You are correct it is a simple bandwidth issue. 16 COGS can manage 50M 32 bit transfers, provided they are packed to 128b.

Now imagine an 8 cog Map setting, that can manage 100M 32 bit transfers. Still 100% deterministic.

Users will not always use all 16 cogs, and a great many designs will be fine with 8 active.

Mapping allows those 8 COG users, to run at half the clock speed for the same Bandwidth, which means a lot to the power envelope. On this chip, the Power Envelope will still matter.

ATE and Instrumentation systems may want to burst-capture many pins, into HUB as fast as silicon allows.
A slightly smarter design may want to burst capture pin-change & time-stamp info into HUB, and a parameter of that design, is the closest-adjacent edge resolution.

Phil Pilgrim (PhiPi) · 2014-05-01 15:00

Re: mooching. What happens when two cogs mooch at the same time? Who gets priority? Oh, I can see it now: someone is going to propose a hub-access queue!

-Phil

Dave Hein · 2014-05-01 15:09

Phil, several weeks ago I suggested a hub arbitration scheme that handles simultaneous requests, and doesn't require a queue. It weights the request by the distance from the cog's hub slot.

potatohead · 2014-05-01 15:11

Exactly Phil.

The simple scheme works well. These others are more than just adding a little feature. Our tasking discussion ended up really long! Some great stuff fell out it all too.

FWIW, these have all been discussed, mooch, maps of all kinds, etc.... never seen as even FPGA ready code or logic.

Of these, the mooch one seemed to have the least impact, but all of them require more to happen in the HUB than seen at first glance.

Heater. · 2014-05-01 15:15

Grrr..HUB access queue, HUB access arbitration. It's all the same and has the same effect.

What we need is NUMA. Full speed ahead in your own HUB RAM space. A bit of a fight and arbitration when you want to share.

potatohead · 2014-05-01 15:16

BTW, a burst capture can be done with 4 COGS now. One could just sync 4 of them up, do the capture, move on, etc...

I'm laughing at the idea of NUMA being in a microcontroller! SGI bought the timing and interconnect protocol from CRAY and scaled it up. Very complex beastie.

But, you do get a single OS image with impressive stats! 1024 CPU online, 50TB RAM, 49TB free!

Rayman · 2014-05-01 16:26

How does one get the result from a 24x24 bit multiply?

With 16x16, I guess I'd expect the 32-bit result in the destination register.
Is that still true if the operands are only 16 bit?

jmg · 2014-05-01 16:36

Rayman wrote: »

How does one get the result from a 24x24 bit multiply?

With 16x16, I guess I'd expect the 32-bit result in the destination register.
Is that still true if the operands are only 16 bit?

I wondered the same thing... I think 16x16 can right justify to the lower 32b in the 48b result, at least for unsigned ?

Phil Pilgrim (PhiPi) · 2014-05-01 16:46

Rayman wrote:

How does one get the result from a 24x24 bit multiply?

It depends on where you assume the radix point is. 8.16 x 8.16 = 16.16, which would be handy for signal-processing apps.

-Phil

dMajo · 2014-05-01 16:52

Regarding HUB slot sharing if memory serves me it was already agreed that if it will be done it will:

1) Any cog can enable a mode (hungry it was called) that can use only the other's cogs unused slots, there isn't any warranty of higher bandwidth (if all the other cogs uses its slots)
2) The Hungry mode will have an instruction for the cog to enable it thus allowing to easy trace it in the code.
3) There will be a global instruction (that every cog can call) to globally disable Hungry mode.

Even Heater and PhiPi, IIRC, agreed on this.
While I am also in favor of a Hungry mode because I feel that application (HMI) cog (spin or c written) can benefits from it and it don't break the determinism I am opposed to any other solution.

I can't understand why people keep continuously popping (forcing) up the matter with new proposals, mapping schemes, allocating tables ...
I will much prefer that instead of wasting time on this a universal usart/i2c/usb/quadspi and crc helper hardware blobs (serdes) will be discussed since this is currently totally missing. It will be nice if it can support also handshake signals and direction/driver signals for rs232/rs485 modes and if not too hard Manchester encoding.

With 100MIPS cogs I think that also Profibus (rs485 based up to 12Mb) is doable. In the industrial filed (and also in the utilities eg electricity, water) most of the production lines (control systems) in factories is driven by siemens plcs and siemens=profibus. Since for Modbus RTU already there is no issues together with Profibus this can open the doors to many propeller based instruments, test equipment, pre-processed sensors (production displays, rfid readers/writers, multi-axis drivers, energy meters, data-loggers) ... that can in this way easily ingrate into the lines.

Rayman · 2014-05-01 17:31

Maybe. But wouldn't it be a pain if I just wanted to do 16x16? I'd have to do two shifts before multiply...

Phil Pilgrim (PhiPi) wrote: »

It depends on where you assume the radix point is. 8.16 x 8.16 = 16.16, which would be handy for signal-processing apps.

-Phil

jmg · 2014-05-01 17:40

dMajo wrote: »

..... a universal usart/i2c/usb/quadspi and crc helper hardware blobs (serdes) will be discussed since this is currently totally missing. It will be nice if it can support also handshake signals and direction/driver signals for rs232/rs485 modes and if not too hard Manchester encoding.

With 100MIPS cogs I think that also Profibus (rs485 based up to 12Mb) is doable. In the industrial filed (and also in the utilities eg electricity, water) most of the production lines (control systems) in factories is driven by siemens plcs and siemens=profibus. Since for Modbus RTU already there is no issues together with Profibus this can open the doors to many propeller based instruments, test equipment, pre-processed sensors (production displays, rfid readers/writers, multi-axis drivers, energy meters, data-loggers) ... that can in this way easily ingrate into the lines.

I would agree, and I think Chip was going to look at putting Serial support HW into the Smart pins.

Once the COG & ALU are done, and opcodes are largely nailed down, a stable base for software will be there.
That point seems to be very close. 24b Multiply was the most recent fine-tune, based on sim-speed and area.

There are requests for a couple of mode-extensions to the COG counters, to better support reload/monoflop timing for tasking.
Those have low silicon cost, and should make the cut if the speed margin allows a couple more modes.

That will then give a stable COG, and allow the details of smart pins to be looked into.

Ariba · 2014-05-01 18:26

Rayman wrote: »

Maybe. But wouldn't it be a pain if I just wanted to do 16x16? I'd have to do two shifts before multiply...

You will just use another instruction:
1) MUL, MULS for a multiply that return the lower 32 bits. You can do 16x16 or 12x20 or 8x24 or other variations
2) SCL for a fractional multiply that in my opinion should shift right the 24x24 result by 22, so you get a scaled result with a coefficient in 2.22 format.
The best would be if the SCL instruction writes the result into a special result register, not to the D register, this allows much more efficient DSP algorythms (ithe result register must be cog-register-mapped). If Chip is interested, I can show some examples that show the difference.

Andy

jmg · 2014-05-01 19:28

dMajo wrote: »

With 100MIPS cogs I think that also Profibus (rs485 based up to 12Mb) ...

Expanding this a little - Profibus transceivers are specd up to 40Mbps and RS485 transceivers run to 50 & 100Mbps.

A 200MHz SysClk with a /4? Rx sample & HW support, could likely get to 50MBd receive region, maybe more for sync protocols.

Cluso99 · 2014-05-01 20:05

heater: Finally youhave picked up on the time/space/implementation argument AT LAST.

Nothingbefore was being argued this way.

Aside from this, if the obex is precluded fromobjects using variable hub slot, however theywere implemented, you and a few others are against it. This is the real FUD.

You want to prevent me, Bill, and others from using the otherwise unused bandwidth, just because someone might not readhow an object(s) is required to be used. Dont worry if it requires n cogs or all hub - thats not an issue?
I am quite capable of managing the issues, and anyone I choose to let use my demanding objects I can inform directly.

It has been asked for proof of benefits - there has been some examples alreadygiven. How about proving how failures can occur with standard obex objects if no objects using slot sharingare permitted in the obex!

Now, I have alreadygiven an example of USB FS where hub slot sharing may be required. Currently using 80MHz FPGA P2 cog code, I cannot perform usb fs. Multiple cogs does not actually help if you cannot read the bits inline! And this was using single clocked instructions of the old P2. I am hoping to get a helper instruction from Chip.
Just so I dont mislead, the P1 overclocked did do usb fs with a lot of non-compliance and iirc 5 cogs.
I am hoping to avoid a lot of this non-compliance, but i dont expect to be fully compliant.

Now, what is the harm if I can write an object that can get 2x slot accesses by taking 2 cogs (paired cogs as in 1 & 9, etc) and by using a slot sharing scheme

Cluso99 · 2014-05-01 20:13

Sorry my **** xoom croaked the post!

If I can launch 2 cogpairs, i could devote the second cogs slot to the first cog. This was myorininal cog sharing request. If I own both cogs, and hence their slots, this cannot possibly cause any other objectto fail. If you dont agree, please explain how!!!

We dont even know how easy/hard it would be for Chip because the FUD posters have only just realised this is an argument to help their cause.

Chip, please consider the time/space/power of including this, even ifyou dont release the details publically.

RossH · 2014-05-01 20:21

Cluso99 wrote: »

Now, what is the harm if I can write an object that can get 2x slot accesses by taking 2 cogs (paired cogs as in 1 & 9, etc) and by using a slot sharing scheme

Yes, I think this scheme would work ok (unlike any of the others I have seen posted!).

Ross.

koehler · 2014-05-01 20:44

Cluso99 wrote: »

This was myorininal cog sharing request. If I own both cogs, and hence their slots, this cannot possibly cause any other objectto fail. If you dont agree, please explain how!!!
.

Sorry, I thought Tubular had first mentioned that COG arrangement.
Nevertheless, I second that request!

Instead of trying to force anyone to prove how it won't fail, which has been amply discussed, please do us the courtesy of showing how it will fail.
Surely that is much easier.

And, no, you don't get to assume that everyone up to and including Prop familiar developers are not going to understand what COG's are currently bandwidth mooched.

Not directed specifically at Heater, but at anyone else on the outspoken 'Nay' camp.

Personally, I try to give people the benefit of the doubt. I'm pretty sure the very first time this bites someone, they'll figure it out quickly and become just that much smarter in the future. If they bork a big project, well, stupid is as stupid does. How much hand holding does a 'Pro' need?

Electrodude · 2014-05-01 20:49

How about instead of wasting a whole cog just so another can be in hungry mode, you add an option to be able to use the other cog in the pair but you'll have to acknowledge in the cognew that it's OK if you only get hub slots when the cog's pair doesn't want the slot. You could make it so when you start a cog, you can pick normal, hungry, or diet mode. The two cogs in each pair of cogs could either both be normal, or one could be hungry and one could be diet. If you start one cog as a hungry cog, its pair will only ever be started if the cognew specifies that a diet cog is OK, otherwise an unpaired cog will be used (or one paired with a diet cog). If there are more diet cogs than hungry cogs, diet cogs would be paired with each other (or with normal cogs) and be promoted to normal mode. It might be easier if a restriction was placed so cogs 0-7 could only be normal or hungry and 8-15 could only be normal or diet, instead of allowing any cog to be normal, hungry, or diet, which would lead to twice as much hub slot arbitration circuitry.

Illustration, supposing any of the 16 cogs can be in any of the 3 modes:

cognew(hungry) -> new cog 0 is hungry, unused cog 8 can only be diet
cognew(normal) -> new cog 1 is normal, unused cog 9 can be diet or normal
cognew(diet) -> new cog 8 is diet, paired with hungry cog 0 and only gets hub slots when 0 doesn't want them
cognew(diet) -> new cog 9 is diet, paired with cog 1 which is normal. There is no hungry cog in this pair, so the diet one gets promoted to normal.
cognew(diet) -> new cog 2 is diet, unused cog 10 can be hungry, normal, or diet
cognew(diet) -> new cog 10 is diet, paired with cog 2 which is also diet. There's no hungry cog in this pair so they're both promoted to normal
cognew(diet) -> new cog 3 is diet, unused cog 11 can be hungry, normal, or diet
cognew(hungry) -> new cog 11 is hungry, paired with cog 3 which is diet and which only gets hub slots if 11 doesn't want them

OBEX objects should be able to use this without punishing other objects or confusing the programmer.

electrodude

RossH · 2014-05-01 21:47

Electrodude wrote: »

How about instead of wasting a whole cog just so another can be in hungry mode, you add an option to be able to use the other cog in the pair but you'll have to acknowledge in the cognew that it's OK if you only get hub slots when the cog's pair doesn't want the slot.

Too complicated, and again introduces non-determinism.

Cluso's model is already used on the P1, where we have (for instance) two cogs co-operating to generate alternate scan lines for a display (from memory, this is in one of Chip's hi-res VGA drivers). But it currently takes a lot of skull-sweat to make this work, whereas Cluso's scheme makes it much easier to do this without the possibility of introducing some non-deterministic "jitter" that would make it unworkable.

Ross.

Electrodude · 2014-05-01 22:13

RossH wrote: »

Too complicated, and again introduces non-determinism.

My method doesn't introduce any indeterminism or jitter in the hungry or normal cogs, and optional indeterminism (only in diet cogs, which you don't get unless you specifically ask for them) is better than wasting a whole cog. Hardware wise, it's not much more complicated than Cluso's method - if the hungry cog in a pair doesn't want a hub cycle, then the diet cog in that pair gets it instead of the cycle going unused. If you don't start any diet cogs, my method would act exactly the same as Cluso's. My method is basically Cluso's, but you can reclaim (with no promises of any hub slots at all) those wasted cogs. If you need determinism, don't put a cog in diet mode, and your cog won't be paired with a hungry cog. Diet mode would mainly be good for things that rarely need the hub or for low-priority things for which it doesn't matter if they spend 90% of their time waiting for the hub as long as they eventually get what they're supposed to do done (like a RTC - it has a whole second to wait for the hub before it has to go into the next waitcnt, it would almost be a miracle if there were no free hub slots in that time).

Obviously, it would be stupid to run diet cogs in hubexec (unless you like running at 1/100 normal speed). It would also be fairly pointless to pair a diet cog with a hubexec'ing hungry cog, as I would be surprised if many hubexec programs EVER don't use hub slots, meaning the diet cog wouldn't ever get any hub slots. Still, it would be nice to have this so it would be possible to reclaim that wasted processing power. Maybe there could be a 4th mode, for hubexec and hungry, to indicate that no diet cogs should be paired with cog because they'll NEVER get a turn, as opposed to only getting them occasionally when paired with a hungry cog that isn't always hungry.

electrodude

RossH · 2014-05-01 22:27

Electrodude wrote: »

Diet mode would mainly be good for things that rarely need the hub or for low-priority things for which it doesn't matter if they spend 90% of their time waiting for the hub as long as they eventually get what they're supposed to do done (like a RTC - it has a whole second to wait for the hub before it has to go into the next waitcnt, it would almost be a miracle if there were no free hub slots in that time).
electrodude

I believe in miracles.

Lawson · 2014-05-01 22:51

Rayman wrote: »

How does one get the result from a 24x24 bit multiply?

Well I'd expect the 128x128 physical layout of cog memory should allow a 24x24 bit multiply to return a 64-bit result. I.e. a "MULB" could return the result in D and D+1. (while a MUL and MULS would just change D)

Marty

P.S. I am strongly against any hub-slot re-mapping. Hub-slot remapping would destroy the simple *effective* hub sharing to potentially gain better utilization of the hub. With quad-reads giving 4 longs every 8 instructions, can cog code even handle data faster than that?

Heater. · 2014-05-01 23:47

Cluso,

...what is the harm if I can write an object that can get 2x slot accesses by taking 2 cogs (paired cogs as in 1 & 9, etc) and by using a slot sharing scheme

That scheme seems perfectly fine to me.

Basically it is bullet proof because you have made the COG pair the "atomic unit" of processing power. That is to say, yes my greedy COG requires some other COG to donate the excess bandwidth, but that's OK because I "own" that other COG, the application sees no effect.

Neat idea.

I imagine that in the extreme the "greedy" COG is running hubexec (do we still have that?) or otherwise requires the extra HUB bandwidth all the time and then the paired COG becomes essentially useless. Whatever code it can run cannot be allowed to use HUB at all.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments