P16+X32B - what might you want as a minimum ?

Bill Henning · 2014-04-07 08:31

I think COGRUN is the new P2 name for COGINIT?

And COGRUNX is P2 for start new cog in HUBEXEC mode?

Thanks... sounds like we can find the space for the new opcodes if we need to (and Chip wants them)

Cluso99 wrote: »

Opcode space for opcode 000011 only uses bits 2:0 of S for CLKSET/COGID/COGINIT/COGSTOP/LOCKxxxx.

Remind me, why do we need COGRUN ? (senior moment or more coffee needed)

potatohead · 2014-04-07 08:39

I think you are right Bill. It was one or the other.

tonyp12 · 2014-04-07 10:13

Fire is not likely, unless many design mistakes are adding up like having non fire-retardant plastic enclosure to close to the chip etc,
But being able to kill the chip just by software though you are using a standard Parallax dev board,
unless the specs for the P1+ requires that the voltage regulator is also a current regulator etc, but will introduce weird behavior at the upper range.
Having the P1+ end up on the wiki's Killer poke page can not be good.

But I base this on that a 32cog P1+ will push the boundaries with wrong implemented software, if it ends up as a 16cog I think there is no need to worry.

SRLM · 2014-04-07 10:44

We need a better way of prioritizing features. The forums have turned out to be a pretty poor collaboration tool.

Maybe for P3, if Parallax still wants to go with community feedback, they can invest in some good collaboration tools. For this particular type scenario, I'm thinking it would be helpful to have a "feature board" that lists all the current and proposed features, with voting. That makes it easier to see what people actually want.

Bill Henning · 2014-04-07 11:37

Tony,

The necessary voltage regulator for vCore will thermally shut down if too much current is drawn.

You can easily kill the chip - especially the pll's.

One or two thermal monitor diodes, and a cog monitoring them is safe.

Random reset / shutdown causes liability issues for Parallax.

tonyp12 wrote: »

Fire is not likely, unless many design mistakes are adding up like having non fire-retardant plastic enclosure to close to the chip etc,
But being able to kill the chip just by software though you are using a standard Parallax dev board,
unless the specs for the P1+ requires that the voltage regulator is also a current regulator etc, but will introduce weird behavior at the upper range.
Having the P1+ end up on the wiki's Killer poke page can not be good.

But I base this on that a 32cog P1+ will push the boundaries with wrong implemented software, if it ends up as a 16cog I think there is no need to worry.

jmg · 2014-04-07 12:39

Bill Henning wrote: »

j
I am 100% convinced hub slot allocate must be kept separate from cog clock gating. See above. The cog should control its own clock gating, as only the code in it knows how many clock cycles it needs... and dropping below the minimum it needs could cost huge problems.

The issue I saw was along the line of

You set a CLK divide of 1/7, as giving the MOPS/Power Envelope you need
You allocate one in every 128 as HUB

Now run both those 'dividers' , and watch what happens to your expected HUB BW and your expected 1/7 power as those dividers sync, and then fall out of sync.
The CLK_Divide is a Clock enable, when it is off, the COG sees nothing.

Bill Henning · 2014-04-07 13:40

cavet developer

I see what you mean, but to the 5 bit cogid for the slot map, you would need to add 16 clock gating bits on a 16 cog part, and 32 on a 32 core part - which won't fit. Such an arrangement also just as easily causes problems by inadvertently starving said cogs of needed cog cycles.

jmg, you and I tend to see eye to eye on most technical issues, I think our differences are usually because:

- you prefer to add extra logic to try to protect the developer from ... ill considered mistakes and not reading the data sheet ...

- I think if they get themselves into trouble from ill considered mistakes, they will learn from them, and become better developers

The school of hard knocks so to speak

jmg wrote: »

The issue I saw was along the line of

You set a CLK divide of 1/7, as giving the MOPS you need
You allocate one in every 128 as HUB

Now run both those 'dividers' , and watch what happens to your expected HUB BW and your expected 1/7 power as those dividers sync, and then fall out of sync.
These are Clock enables, when one is not fed, the COG sees nothing.

jmg · 2014-04-07 13:59

Bill Henning wrote: »

I see what you mean, but to the 5 bit cogid for the slot map, you would need to add 16 clock gating bits on a 16 cog part, and 32 on a 32 core part - which won't fit.

?? What do you mean it won't fit, ? - it is just a single small RAM. it can be any form factor you like.

Bill Henning wrote: »

jmg, you and I tend to see eye to eye on most technical issues, I think our differences are usually because:

- you prefer to add extra logic to try to protect the developer from ... ill considered mistakes and not reading the data sheet ...
- I think if they get themselves into trouble from ill considered mistakes, they will learn from them, and become better developers

It is the nasty fish hooks I try to avoid, whilst trying to keep the explanation of how it works, as simple as possible, and keeping things clearly deterministic.

( A mapping array even allows a 'traveling wave' approach to COG data and power.)

I think there is also a need for a Wrap (modulus) control on these mappings, to better match however many COGs the design has alive.
This allows jitter to be eliminated, but a separated HUB / Counter scheme does not support that.

Counter is 1/M, on power, whilst the bit-array gives P/Q power control.

The picket-array approach needs more 'bits', but probably similar(even smaller?) die area, as is all in ONE compact RAM, not 16 copies of FF & Counters.

Bill Henning · 2014-04-07 14:14

jmg wrote: »

?? What do you mean it won't fit, ? - it is just a single small RAM. it can be any form factor you like.

Sorry. I was not clear.

16+5 bits per slot, for a table indexed by X low bits of cnt would of course fit.

I am still hoping that if there is enough room at the end, once Chip finishes tweaking the cogs to his liking, we will get one of:

- a significant increase in hub (ie a nice big chunk, 128KB++... a 640KB P1+ would be hillarious....

"nobody will ever need more than 640Kb" ... was that Gates?

Anything less than 128KB additional hub, I'd far rather go to 32 cogs... oh the luxury of SIMPLE drivers! (or as many as will fit, in addition to 16, since I can't get P2 cogs)

32+5 bits per array entry does not fit in a long. If a 128 entry hub slot array it fits, it would be great for precise deterministic bandwidth control!

That is what I meant.

jmg wrote: »

It is the nasty fish hooks I try to avoid, whilst trying to keep the explanation of how it works, as simple as possible, and keeping things clearly deterministic.

( A mapping array even allows a 'traveling wave' approach to COG data and power.)

I think there is also a need for a Wrap (modulus) control on these mappings, to better match however many COGs the design has alive.
This allows jitter to be eliminated, but a separated HUB / Counter scheme does not support that.

Counter is 1/M, on power, whilst the bit-array gives P/Q power control.

The picket-array approach needs more 'bits', but probably similar(even smaller?) die area, as is all in ONE compact RAM, not 16 copies of FF & Counters.

Strongly disagree with modulus, better to have it always be a deterministic 1/128 increment.

Consider object designed for 1/100 slices mixed with object designed for 1/128 slice with one designed for 1 in 30.

IMHO KISS principle applies here.

I think the best approach is a separate small divider - say 3 bit counter - per cog, used to gate the clock. Small cdiv[num_clocks] table, keeps it separate from hub slot assignment, not too many gates.

Having said that, I am not sure that is needed at all, with the pipeline stall current being somewhere between quiescent current, and 10% of typical (I think your estimate?) as an unused cog can simply be stopped.

jmg · 2014-04-07 14:37

Bill Henning wrote: »

Strongly disagree with modulus, better to have it always be a deterministic 1/128 increment.

Err, but that ignores jitter, and forces designers into a modulus 128 straight jacket.
Why give users (almost) full control, but then cripple it at the last minute ?

If you really want to protect people, then they do not change the Modules, but do not remove that tool from an experienced designer (who you claimed earlier to back, over the novice ?)

Bill Henning wrote: »

IMHO KISS principle applies here.

KISS that adds jitter, or removes pgmr control, is not KISS at all, it is KILL

Bill Henning · 2014-04-07 14:49

jmg,

Known, predictable jitter, same for all cogs

vs.

Obex objects not having a clue as to what their allocation will be.

Object should state minimum required hub bandwidth, if developer does not know that figure, why is he developing?

Obj #1: wants 1/30 --> give it 4/128 or 5/128 depending on required minimum (easy for automatic tool), for serial / ps2 kb / mouse / most low speed devices jitter for hub access is 100% irrelevant and often 1/128 is too much for them

Obj #2: wants 1/16 --> give it 8/128

Obj #3: wants 1/11 --> give it 8/128 or 9/128 depending on required minimum (easy for automatic tool)

Obj #4: wants 32/128 for video or high speed sampling/output pins

Now an object demands setting the modulus to 1/100

BOOM. VGA driver or sampling fails big time because it was designed for 32/128 locked to the hub.

A slight jitter, for the low bandwidth cogs, is (IMHO) insignificant

Fixed table size, at a multiple of number of cogs (strongly recommend cogs x 4) makes high bandwidth cogs fully deterministic - and hitting hub access sweet spots is critical for those. Getting jitter on those breaks them.

Conclusion:

1) modulus breaks high bandwidth drivers

2) small jitter on low bandwidth cogs is irrelevant, they can use waitcnt, and most likely will, instead of locking to the hub

jmg wrote: »

Err, but that ignores jitter, and forces designers into a modulus 128 straight jacket.
Why give users (almost) full control, but then cripple it at the last minute ?

If you really want to protect people, then they do not change the Modules, but do not remove that tool from an experienced designer (who you claimed earlier to back, over the novice ?)

KISS that adds jitter, or removes pgmr control, is not KISS at all, it is KILL

jmg · 2014-04-07 15:05

Bill Henning wrote: »

jmg,
Known, predictable jitter, same for all cogs

Err, I'd prefer having NO jitter, and things under MY control.

Your examples are simply satisfied with a value in the Modulus setting, but you cannot solve my jitter problem in your 128 straight jacket.

I'm not following your contradictory stance, you earlier claimed caveat developer, but now conspire to remove choice & control from the developer ? The silicon cost of giving the designer control is tiny.
Your case is a subset, a chip can do both.

tonyp12 · 2014-04-07 15:32

You have few cogs using values based on 1/128 as you want results to be completed at a same time interval.
What will happen I a separate cog uses 1/100?, now and then it will fight for the same time slot as one of the above

Maybe the code for that driver come from obex and it came with the 1/100 value as default, so simple change is so now all your drivers use the 1/128.
So jmg is right, if there is no need to hardcode a value in why not let the user set it, even using module the result can still be 1/32, 1/64 and 1/128 round robin.

Bill Henning · 2014-04-07 15:35

I think we are having a communications issue, so NAK

My four object example is extremely clear (I thought)

Allow me to re-state.

A high-bandwidth cog needs 32/128 or 64/128 cycles precisely. As it is high bandwidth, and hub locked, its needs are the most important.

Low bandwidth cogs, the jitter does not matter. Forgive me, you have not given any examples where it matters.

Example:

1mbps serial. 100,000 bytes per second. Given 1/128 hub slots, it gets 1,562,000 hub cycles every second. It will use less than 1/15th of those. Jitter is irrelevant.

10mbps serial. 1,000,000 bytes per second. Given 1/128 hub slots, it gets 1,562,000 hub cycles every second. It will use 2/3 of those, Jitter is irrelevant, but anyone that worries, configure it for 2/128

20mbps serial. 2,000,000 bytes per second. Needs 2/128 hub slots, it gets 2,124,000 hub cycles every second. Give it 3/128 of those. Jitter is irrelevant.

jmg,

I am not trying to be a pain. I like learning.

Please give similar examples where the jitter to hub access, for low speed drivers, matters.

jmg wrote: »

Err, I'd prefer having NO jitter, and things under MY control.

Your examples are simply satisfied with a value in the Modulus setting, but you cannot solve my jitter problem in your 128 straight jacket.

My stance is not contradictory, I simply think the modulus is an unnecessary complication that is likely to cause significant problems for high bandwidth cogs, and ruins their determinism.

Please see my examples, I can be brought around to your point of view if modulus is the lesser evil than jitter.

Right now, I believe that modulus is the greater evil. Convince me otherwise.

I'm not following your contradictory stance, you earlier claimed caveat developer, but now conspire to remove choice & control from the developer ? The silicon cost of giving the designer control is tiny.
Your case is a subset, a chip can do both.

Bill Henning · 2014-04-07 15:37

please see my message to jmg.

Having a fixed 128 entry table gets rid of the need for modulus, and changing modulus would cause significant problems for high bandwidth cogs, where jitter can be deadly (not writing video on time, not sampling at the correct rate etc)

Low bandwidth cogs don't care about the jitter

We CANNOT mix/match objects that expect a different modulus in the table, that would make Obex unusable.

tonyp12 wrote: »

You have few cogs using values based on 1/128 as you want results to be completed at a same time interval.
What will happen I a separate cog uses 1/100?, now and then it will fight for the same time slot as one of the above

Maybe the code for that driver come from obex and it came with the 1/100 value as default, so simple change is so now all your drivers use the 1/128.
So jmg is right, if there is no need to hardcode a value in why not let the user set it, even using module the result can still be 1/32, 1/64 and 1/128 round robin.

tonyp12 · 2014-04-07 15:57

Bill is right too, if some obex code have higbandwitdh cogs that only works with a 16/100 module setting but some other obex codes only work at 32/128.
How would a person use both?, as to get a obex standard maybe forcing them to the 1/128 is maybe the only way to go?

jmg · 2014-04-07 16:13

tonyp12 wrote: »

Bill is right too, if some obex code have higbandwitdh cogs that only works with a 16/100 module setting but some other obex codes only work at 32/128.
How would a person use both?, as to get a obex standard maybe forcing them to the 1/128 is maybe the only way to go?

Sure, I have no issues with there being OBEX suggestions or frameworks.

What I do have an issue with, is being forced into a straight jacket, 'just because', when there is an easy superset solution (ie supports both design paths) available, at negligible die cost.

jmg · 2014-04-07 16:20

Bill Henning wrote: »

Low bandwidth cogs, the jitter does not matter.

The prop is famous or being jitter free - as soon as you admit there is a jitter risk in your design, any claims of you think it does not really matter in XYZ case, are moot.

Bill Henning · 2014-04-07 16:43

jmg,

If you can present me with a technical analysis showing that slight jitter matters to low bandwidth usage drivers, and it makes a good case, you can convince me.

Remember the outcry against mooch on very theoretical basis? This is NOT theoretical. This WILL easily fail.

If you cannot provide good technical reasons for a variable length slot allocation table, then there is no reason to introduce the modulus complication that I've proven CAN cause problems.

P1 is famous for jitter on VGA output - the famous dot crawl.

A TON of people objected to purely theoretical, extremely unlikely "mooch" issues, that would have required bad programming to show up.

Unlike that, modulus has clearly proven (I proved it above) issues for VGA and other high bandwidth drivers.

Again, you say that the jitter is a problem.

Prove that minor jitter when slow drivers can read/write the hub (which the drivers cannot notice, and cannot matter) it is a bigger problem than VGA & high bandwidth drivers NOT WORKING.

I don't think you can prove that, as stopping the VGA etc drivers cold trumps minor theoretical niceties.

jmg · 2014-04-07 16:47

I fail to follow your logic.

My solution is a superset, there is no risk to you - you need never change the register, if you do not want to.
For those that do, caveat developer, your own words exactly.

Bill Henning · 2014-04-07 18:34

jmg,

Please re-read my analyses, and you will follow my logic. I believe you are skimming and not understanding the analysis.

If anyone changes the modulus, they cannot reliably use high bandwidth obex objects.

ALL obex objects must be written to the same slot assignment table length, or you CANNOT pick, choose and mix objects, due to missed hub windows for cogs locked to the hub.

Here is another proof - last one I will write.

User picks:

32/128 1080P driver #1
32/128 1080P driver #2
32/128 SDRAM driver
16/128 Hubexec business logic
1/128 1Mbps serial
1/128 1Mbps serial
20/100 sprite engine.

Changes modulus to 100.

The display drivers fail to work properly, not enough bandwidth

Sprite engines fail, not enough bandwidth

sdram/hubexec/serial not affected too much.

Ooops. Not enough slots left, even of the 1080p did not depend on exactly spaced 1/128 hub cycles to work.

Ok, let's drop driver #1.

32+32+16+2+20 = 102

Heck, lets drop the serial ports.

32+32+16+20 fits modulus 100!

OOPS. Display drivers are locked to hub, no longer get perfectly spaced (1:8) hub cycles. VISIBLE display corruption a minimum, "SYNC OUT OF RANGE" almost guaranteed.

Your modulus prevents obex mix and match.

If you do not want to admit that, provide a similar analysis, proving your point.

If you cannot prove your point, forget about modulus.

jmg wrote: »

I fail to follow your logic.

My solution is a superset, there is no risk to you - you need never change the register, if you do not want to.
For those that do, caveat developer, your own words exactly.

jmg · 2014-04-07 19:22

Here are a couple of very simple examples, On my-design Prop (which prove the danger of avoiding an idea 'just because'.)

Suppose I have a 120MHz system and I want to output a low power, jitter free, Audio signal. at 48000Hz

On my-design Prop, I set up the modulus to 125 and now every 20 cycles, there is a hub slot nicely phase locked.
I allocate one power-slot, to give the lowest power usage, and now that process needs 0.8% of the Power Envelope.
It can take 20 cycles in the COG at this 0.8% power setting, no wasted cycles.

Suppose I also have a new DDS waveform generator, that takes 5 cycles and needs one HUB access,
I set that up for 100% Power slot access and 5 enabled-cycles per HUB, on a 5n = 125 modulus
Simple, and works like a charm. Code, and then Map.

By allocating both power and Hub in the same table, I can see at a glance how my system will perform in the Power Envelope

See how I have full system control, and I am not limited by anyone else's design oversight ?

Bill Henning · 2014-04-07 20:43

jmg,

I was able to decipher the your first example!

jmg wrote: »

Here are a couple of very simple examples, On my-design Prop (which prove the danger of avoiding an idea 'just because'.)

Suppose I have a 120MHz system and I want to output a low power, jitter free, Audio signal. at 48000Hz

On my-design Prop, I set up the modulus to 125 and now every 20 cycles, there is a hub slot nicely phase locked.
I allocate one power-slot, to give the lowest power usage, and now that process needs 0.8% of the Power Envelope.
It can take 20 cycles in the COG at this 0.8% power setting, no wasted cycles.

What I think you meant to say:

48Khz audio, say one 16 bit sample read from the hub, 48,000 times per second. Since you did not state number of channels, I assume 1.

You want a 120MHz clock, presumably so you get 2,500 clock cycles between samples.

I figured this out when I tried dividing 100MHz clock by 48,000 to and got 2083.33 cycles between samples. Figuring the .33 cycles was the jitter you were

Ok, 2500 makes it clearer where you got the 125 and 20 from.

The "Propeller Way" to do this:

Run at at 60MHz, you don't need the extra speed

Now you have 1250 cycles between samples.

Here is the code:

             mov    time,cnt
             add     time,cycles

loop     rdword  val, samples
            add       samples,#2
            waitcnt  time,cycles
            ' insert dac output code here, I2S, i2c, serial whatever, set CTR's for R/C dac, you have a TON of time
            jmp       #loop

cycles   long   1250 ' use 2500 at 120MHz, etc

Absolutely no jitter. Does not break obex. Does not need obex-killer modulus or unnecessary power bits.

Need to save more power? Drop the clock rate more.

jmg wrote: »

Suppose I also have a new DDS waveform generator, that takes 5 cycles and needs one HUB access,
I set that up for 100% Power slot access and 5 enabled-cycles per HUB, on a 5n = 125 modulus
Simple, and works like a charm. Code, and then Map.

By allocating both power and Hub in the same table, I can see at a glance how my system will perform in the Power Envelope

See how I have full system control, and I am not limited by anyone else's design oversight ?

No, I don't see that, nor do I understand what you are trying to accomplish.

If you describe it clearly enough, I suspect it only needs a tiny assembly snippet to work - without jitter.

And I would like to understand what you are trying to do.

P16+X32B - what might you want as a minimum ?

Comments