Propeller II update - BLOG

Bill Henning · 2013-12-01 12:08

Thanks.

2x improvement for the pair is the "maximum" except for some edge cases; basically using RDxxxC for consecutive hub access would look like single clock hub reads. Biggest impact would be for video and LMM.

I held off posting until I was pretty sure that this way would be non-intrusive, non-obex-busting, as the obex argument was pretty powerful.

Right now, at 160Mhz, the hub bandwidth would be 4 longs every 8 cycles per cog. This would be doubled for one cog in a "pair". And guaranteed.

High-bandwidth apps that pair would be using one object file anyway, so this would be transparent to users.

Non-paired cogs could approach 2x, but due to hub cycle windows would get an increase between 1.1x-2x, non-guaranteed.

Heater. wrote: »

Bill,

You may have a acceptable "non-intrusive solution their. I'm too tired to tell.

This sort of statement inclines be to believe that the performance improvements we could expect from all this "HUB busting" are not so great.

Any idea what gains it actually does get us?

Ahhg...what am I saying. I already decided we are not going to do this.

Bill Henning · 2013-12-01 12:12

I disagree that it causes problems, as cogs that use the 'any available' slot option would know they cannot rely on deterministic hub locked cycles.

The use case for those cogs is low speed drivers, copying memory from hub location to bub location etc., it would also speed up LMM, Spin and other VM's.

So I'd either use my list, or add

Option 4: use my own slot and any available slot
Option 5: only use freely available slots

I see no reason to limit your option 3 to the twins slot, as it is already non-deterministic at the microsecond level.

ctwardell wrote: »

I think the "unused slots can be used by any cog" is what causes trouble.
This introduces additional jitter in how long a hub op may take and an object that works when there are a lot of 'free slots' may break when there aren't so many slots.

My proposed solution from post #3082 (it was #3083 at one point...) was more restrictive:

Hub Access Pairs

COG0 and COG4
COG1 and COG5
COG2 and COG6
COG3 and COG7

Option 1: Use 'my' hub slot only
Option 2: Use 'my' hub slot and 'my twins' hub slot if available
Option 3: Use 'my twins' hub slot if available

Setting the 'high performance" COG to Option 2 gives it guaranteed 2X hub slots, while the "donor" set to Option 3 gets whatever hub slots it's twin doesn't use.

C.W.

David Betz · 2013-12-01 12:15

Heater. wrote: »

David,

ArrgrgggH....

While AUX RAM is cool and has lots of nice addressing modes, it isn't likely to be useful for a compiled language like C because it's too small. The ability to execute instructions directly from a large memory would be a lot more useful. I'm afraid that most of the new features in P2 will mainly benefit people writing PASM and maybe interpreter writers if they can live within the size constraints of the AUX RAM for stack. They won't likely be very useful for compiled languages. The number of people willing to write code in assembly language is probably pretty small compared with the number who would want to use a language like C so I would think some consideration would be given to improving C performance and I don't think that obscure ways of speeding up LMM really qualifies.

jmg · 2013-12-01 12:17

cgracey wrote: »

It's too bad that we can't have this feature, but keep it hidden from all except those who would use it with full responsibility. Maybe putting the chip into a special mode could open up this feature.

it defaults to off, is that not hidden enough ?
Also, I think the priority scheme you outline, need two consenting COGS, which makes it hard to break accidently ?
Any OBEX that needs this, will need to launch 2(+?) COGS, with a skeleton in one users can expand on.

cgracey wrote: »

Is there some sort of compromise that could allow this feature to exist, but without encouraging bad behavior? I kind of see it as a tool to employ when an application is done, but you'd like to open the valve all the way to maximize performance.

I see the main benefit in those cases that NEED some minimum bandwidth, so those are likely to start out using this, not add it later
'BW turbo' is only going to be used in rare applications, and certainly not in all COGs.
eg Some Video / USB / Ethernet apps could use this, but even those do have frequent pauses where their BW-demands fall, and other slots can open up.

There are 8 COGs, and in P2 is will be rarer to use all 8, as threading allows more in any one COG.
That means many situations where bandwidth is simply wasted, so even a simple system to boost BW is desirable.

All that said, if I were asked 'what would I work on first, SerDes finalisation or HUB slots?', I would have to put SerDes finalisation ahead of HUB slots.

brucee · 2013-12-01 12:19

I hate to mention it, but while the engineering is the fun side of the design process, market reality is important.

What I mean is, what is the target market, target application? If that were defined a lot of these questions are easy to answer.

It is not possible to design a chip to do everything well. There are always tradeoffs, and this is especially true when you are behind the curve on IC processing. So is the target the hobbyist community that gives maybe simpler programming than a Rpi for graphics, or some fast graphics engine, though without hardware support it will be orders of magnitude slower than game box chips, and desktop decode engines, or some other unnamed application?

To me, Parallax is in the simple/reasonable performance end of the hobbyist market. I just don't see it competing in volume consumer products, and with 8 year rollouts, it is definitely behind.

ctwardell · 2013-12-01 12:20

Bill,

The reason for getting rid of the "any slot", is there has been a lot of concern voiced about people writing objects that only worked because they were taking advantage of "free slots".

I personally am happy with Chip's solution, but I'm afraid of seeing the baby thrown out with the bathwater over fears that this is going to kill the OBEX.

So I'm content with a more strict implementation if that is what it takes to get this feature included.

C.W.

ctwardell · 2013-12-01 12:23

David Betz wrote: »

While AUX RAM is cool and has lots of nice addressing modes...

Don't forget, it's also used as the CLUT for video isn't it?

Calling it AUX RAM just lets people know you can do a lot more with it.

C.W.

Bill Henning · 2013-12-01 12:24

David,

A hardware LMM mode would still be limited with how the hub memory is accessed, and even more limited for executing out of SDRAM (which would still be limited by needing to share it with video, other cogs etc).

The "real" hardware solution is a different kind of cog, with as large as possible, single cycle, multi-line associative mapped caches for data and code, designed specifically for compiled code. An MMU of some sort is also pretty much a requirement. This would make it non-deterministic, but for the glue "business logic" this would not matter.

David Betz wrote: »

While AUX RAM is cool and has lots of nice addressing modes, it isn't likely to be useful for a compiled language like C because it's too small. The ability to execute instructions directly from a large memory would be a lot more useful. I'm afraid that most of the new features in P2 will mainly benefit people writing PASM and maybe interpreter writers if they can live within the size constraints of the AUX RAM for stack. They won't likely be very useful for compiled languages. The number of people willing to write code in assembly language is probably pretty small compared with the number who would want to use a language like C so I would think some consideration would be given to improving C performance and I don't think that obscure ways of speeding up LMM really qualifies.

Bill Henning · 2013-12-01 12:28

ct,

I understand, but the scheme I last posted addresses that concern - it does not hurt to use the 'any slot' if a cog wants to, as if it needs strict 1/8 cycle determinism it can simply not use that mode.

If it uses the 'deterministic'+'any' mode, it speeds up LMM, memory copying etc, without impacting other cogs.

A win-win scenario.

By pairing the high bandwidth cogs, and they belong in a single object, the issue of concern becomes moot.

Actually, if there was a COGINITPAIR, it would even make it more readable :-)

ctwardell wrote: »

Bill,

The reason for getting rid of the "any slot", is there has been a lot of concern voiced about people writing objects that only worked because they were taking advantage of "free slots".

I personally am happy with Chip's solution, but I'm afraid of seeing the baby thrown out with the bathwater over fears that this is going to kill the OBEX.

So I'm content with a more strict implementation if that is what it takes to get this feature included.

C.W.

Bill Henning · 2013-12-01 12:31

Most compiled C code on small microcontrollers has a small stack.

Unless deep recursion is involved it will be VERY useful - as functions with a limited number of arguments and locals can access their locals and arguments with a single instruction, in a single cycle.

I suspect I will be using the AUX addressing modes extensively.

Don't forget, by removing the DAC ring bus, Chip has freed up 20% of the die's real estate; that would allow a LOT more AUX memory, which would be easy to implement.

It would be quite difficult to add more cog memory - and besides, due to dual porting (vs. quad port) we can get 2x the AUX memory as could be added to the cog for the same transistor budget.

David Betz wrote: »

While AUX RAM is cool and has lots of nice addressing modes, it isn't likely to be useful for a compiled language like C because it's too small. The ability to execute instructions directly from a large memory would be a lot more useful. I'm afraid that most of the new features in P2 will mainly benefit people writing PASM and maybe interpreter writers if they can live within the size constraints of the AUX RAM for stack. They won't likely be very useful for compiled languages. The number of people willing to write code in assembly language is probably pretty small compared with the number who would want to use a language like C so I would think some consideration would be given to improving C performance and I don't think that obscure ways of speeding up LMM really qualifies.

jmg · 2013-12-01 12:36

ctwardell wrote: »

The reason for getting rid of the "any slot", is there has been a lot of concern voiced about people writing objects that only worked because they were taking advantage of "free slots".

This could be fixed by "any slot" => "any consenting slot" ?

That means any OBEX 'dropped in', would get the slots it expected, unless it was modified to consent to a lower access rate.

The inference is, that does not fall to zero access.

Alternative means of Consenting:
One way to manage that, would be a 9th slot, that is managed on a simple round robin. BW is lower (fSys/72?), but it is always available at that ~2Ma rate. (which happens to be about a present P1 hub rate, flat out ? )
COGS that choose that, can 'gift' another 20Ma each, to a hungry COG.

This is simple to understand, and still gives a serious and deterministic bandwidth.

Bill Henning · 2013-12-01 12:37

Chip,

You must have missed my earlier questions about the transistor budget... not surprising in this blizzard of postings!

The 20% of die freed up - about how many transistors does that free up?

1) how many transistors does 32 bits of HUB memory (single ported) take?

2) how many transistors does 32 bits of AUX memory (dual ported) ?

3) how many transistors does 32 bits of COG memory (quad ported?) take?

4) how man transistors would 32 bits of eight ported memory take?

Thanks,

Bill

ctwardell · 2013-12-01 12:37

Bill,

Your "Option 4: use my own slot and any available slot", and Chip's that included "use any" are the ones that seem to cause concern.

The stated fear is that someone, a noob in particular, will create an object that only worked because it was making use of it's own slots + free slots.
That user then makes some changes in another object that cause less free slots to be available and now they want to know why their object broke.
They now hate the propeller and become an ardent Arduino supporter...

I think a key point that may be getting the way of us understanding each other is that the noob is not aware he is making use of those extra cycles.

C.W.

jmg · 2013-12-01 12:39

Bill Henning wrote: »

Actually, if there was a COGINITPAIR, it would even make it more readable :-)

Yes, see my comment in #3215 of COG-pair launch.

jmg · 2013-12-01 12:42

ctwardell wrote: »

Your "Option 4: use my own slot and any available slot", and Chip's that included "use any" are the ones that seem to cause concern.

What if the semantics changes to
"Option 4: use my own slot and any available consenting slot" ?

Bill Henning · 2013-12-01 12:44

I think we agree...

(but my eyes are starting to glaze over from this thread)

jmg wrote: »

Yes, see my comment in #3215 of COG-pair launch.

ctwardell · 2013-12-01 12:53

jmg wrote: »

What if the semantics changes to
"Option 4: use my own slot and any available consenting slot" ?

Not really. The issue isn't on the granting consent side, we aren't stealing anyones hub slot.

The issue is writing an object that unknown to the author was only working because it was using extra slots, and maybe specific extra slots.

For example maybe the object works fine when it is seeing free slots that are 4 slots away from its own, but it fails when free slots are from just 1 or 2 slots from its own.

Chip's solution lets you say just use my own plus slots from a specific other COG -OR- from any COG.

C.W.

Bill Henning · 2013-12-01 12:56

cw,

Sorry, I guess I replied too quickly to your post.

I now better understand your concern.

I tried to address that in my post #3206, where I had four "modes" for a cog:

00 - normal, P1 style round robin - but unused hub cycles are available to any cog
01 - "giving" cog of a cog pair (it volunteers its hub slot to its pair) - unused slots can be used by any cog
10 - "greedy" cog of a cog pair (first claim to unused pair-cog hub cycles) - unused slots can be used by any cog
11 - low bandwidth cog, only uses totally unclaimed cog hub cycles

Notice that there is no unlimited mode for "use my slot, and any free slot", which is I think what you are worrying about.

The default would be hub mode 00, which behaves exactly like P1 (except unused slots are available to other cogs)

01/10 would be a high performance cog pair, newbies would not be writing these - and in effect, to other cogs, it looks like "business as usual", however a few spare hub cycles may be available to cogs that can use it.

11 - low bandwidth cog - is I think the only one you would be concerned with. In this case, a cog would freely give up its regular slot (but could use it if no other cog claimed it) and use the table scraps of other cogs. I see this as being used for low bandwidth drivers (PS/2, low speed serial etc). I really doubt a newbie would write a low bandwidth cog.

I think we are looking at this from a different point of view. I see the "spare" hub cycles as a bonus a cog may get, but may not depend on.

The paired mode is the only one that guarantees more bandwidth, and newbies would not know enough to write these objects. Newbies could use them, as they would not depend on more than the pairs total bandwidth.

The low bandwidth mode again is a special case, the table scraps mode - again newbies would not be writing these, but could use them. The high bandwidth paired mode will still leave table scraps, so will mode 000 cogs - so it would be close to impossible to starve even a 11 cog.

So if all cogs start in HUBMODE %00, newbies etc can write without fear

Only experts would use the paired mode %01/%10, and the low bandwidth mode %11

edit: Note, only mode 01/10/11 cogs could use the spare cycles from a mode 00 cog, so a newbie could not accidentally depend on getting extra cyles for it.

ctwardell wrote: »

Bill,

Your "Option 4: use my own slot and any available slot", and Chip's that included "use any" are the ones that seem to cause concern.

The stated fear is that someone, a noob in particular, will create an object that only worked because it was making use of it's own slots + free slots.
That user then makes some changes in another object that cause less free slots to be available and now they want to know why their object broke.
They now hate the propeller and become an ardent Arduino supporter...

I think a key point that may be getting the way of us understanding each other is that the noob is not aware he is making use of those extra cycles.

C.W.

pedward · 2013-12-01 12:57

Bill Henning wrote: »

Chip,

You must have missed my earlier questions about the transistor budget... not surprising in this blizzard of postings!

The 20% of die freed up - about how many transistors does that free up?

4) how man transistors would 32 bits of eight ported memory take?

Thanks,

Bill

Port D

jmg · 2013-12-01 13:06

ctwardell wrote: »

Not really. The issue isn't on the granting consent side, we aren't stealing anyones hub slot.

The issue is writing an object that unknown to the author was only working because it was using extra slots, and maybe specific extra slots.

For example maybe the object works fine when it is seeing free slots that are 4 slots away from its own, but it fails when free slots are from just 1 or 2 slots from its own.

I'm not quite following this, because to trigger your example change, the user has to change the COG consenting pair.
So it is not really a 'surprise' if they change something labeled 'This is important' and it breaks ?

potatohead · 2013-12-01 13:07

The end game on pairing is a 4 COG chip.

ctwardell · 2013-12-01 13:14

jmg wrote: »

I'm not quite following this, because to trigger your example change, the user has to change the COG consenting pair.
So it is not really a 'surprise' if they change something labeled 'This is important' and it breaks ?

Yeah, we seem to be miscommunicating somewhere.

I agree, what you call consenting slots are not an issue.

The issue is 'promiscuous slots' where we say that a COG can use ANY slot not being used by another COG.

Like Bill mentioned above, the 'promiscuous slots' can be useful for background task type things, they just seem to be the ones that have some people scared and wanting to kill this.

C.W.

jmg · 2013-12-01 13:16

Bill Henning wrote: »

01/10 would be a high performance cog pair, newbies would not be writing these - and in effect, to other cogs, it looks like "business as usual", however a few spare hub cycles may be available to cogs that can use it.

Agreed

Bill Henning wrote: »

11 - low bandwidth cog - a cog would freely give up its regular slot (but could use it if no other cog claimed it) and use the table scraps of other cogs. I see this as being used for low bandwidth drivers (PS/2, low speed serial etc). I really doubt a newbie would write a low bandwidth cog.

Another way to manage the low bandwidth cog, is to add a moderate bandwidth 9th slot, round robin available,
See my #3222

This guarantees a no interactions minimum access bandwidth, of ~ fSYS/72, and each COG that consents to use this moderate bandwidth path, can gift another 20Ma to a "greedy" cog.
Another advantage is that moderate bandwidth path, is still about the same as a P1 offers, so it is a porting candidate.

ctwardell · 2013-12-01 13:20

We are in agreement Bill about how it 'should' work, I'm just looking for ways to set it up that help alleviate the fears that it is going to cause havoc.

I stated earlier that "With power comes responsibility to use it wisely, I think we are up to the task." and was told that attitude leads to the road to ruin...

C.W.

potatohead · 2013-12-01 13:21

It's not about us. That is the key realization to work through here.

jmg · 2013-12-01 13:23

ctwardell wrote: »

Yeah, we seem to be miscommunicating somewhere.

I agree, what you call consenting slots are not an issue.

The issue is 'promiscuous slots' where we say that a COG can use ANY slot not being used by another COG.

Like Bill mentioned above, the 'promiscuous slots' can be useful for background task type things, they just seem to be the ones that have some people scared and wanting to kill this.

- but there are no 'promiscuous slots', if it can only get what another COG yields.

It cannot accidently find itself with all empty slots on a test bench, that has to be deliberately coded in.

An option of a 9th slot adds a partial yield, where the yielding COG still has a defined minimum bandwidth available.
(that also means code does not need to thrash permissions, but can pretty much set and forget )

potatohead · 2013-12-01 13:24

So I've got two objects that I really need, and they both make the assumption that the "extra" 9th slow cycles are there. What happens then?

Cluso99 · 2013-12-01 13:29

Better remove the multi-tasking. Its way too complex for most users.

Scenario:
We have cars that can go 200+mph. But the absolute limit in the USA is 55mph (not sure because I am in OZ, so lets use it).
So we best limit the car to 55mph.
Ah but the safety conscious say that in some places the limit is 25mph.
So we best limit the car to 25mph, so it can never be driven over the lowest speed limit.
But we have racing drivers who want to race on the race track, but now they cannot.

This is the same with the P2 and hub slots. Do you really want to limit what it is capable of ???
I don't !!!

If I want to write a nice video driver that can utilise the spare time to its advantage, I will fetch a line of data as quickly as possible from hub to aux (clut). I can then go off and do other things within the same cog. In a hires app, I will no doubt have another cog doing the updating to the video buffer in hub (maybe in sdram). I can do this while yielding to the video generator when needed by donating my hub slot when required. These would not be using other spare slots because we want to keep determinism/timing without compromise. They may be placed in the OBEX as a "pair" of cooperating programs. Should I/we be prevented from doing this because it is too complex for the hobbyists or junior engineers?

Almost all prop programs have one main cog program, and lots of little driver programs. Should the main program be held back by not using spare cogs, no matter how many go unused? Just because it slows down if there is a lot of activity is not a reason to deprive it of going fast when there is not.

I always want to push the envelope - always have, always will.
So with that in mind, it's my 2c.

ctwardell · 2013-12-01 13:30

jmg wrote: »

- but there are no 'promiscuous slots', if it can only get what another COG yields.

It cannot accidently find itself with all empty slots on a test bench, that has to be deliberately coded in.

An option of a 9th slot adds a partial yield, where the yielding COG still has a defined minimum bandwidth available.
(that also means code does not need to thrash permissions, but can pretty much set and forget )

The case I am calling 'promiscuous' are the slots yielded to 'any' instead of a specific COG.

I'm not feeling the love for the ninth slot, that really does preclude leaving things at a default and having things behave "P1 Style".

C.W.

jmg · 2013-12-01 13:31

potatohead wrote: »

So I've got two objects that I really need, and they both make the assumption that the "extra" cycles are there. What happens then?

They cannot simply "make the assumption", they need a consenting pair (or more), to grant them the extra cycles.

Where this gets murky, is what if the consenting pair wants some rare hub access ?
If they use their slot, they risk tripping the hungry cog, but could do this on a SW permission handshake.

Or can it rely on the group lottery of 'unused slots' to meet that, or should it have another deterministic moderate bandwidth choice ?

Propeller II update - BLOG

Comments