P2 - New Instruction Ideas, Discussions and Requests

Bill Henning · 2014-03-28 10:50

Guys,

I am becoming very concerned that having more than one cog could allow users to be confused and make debugging code more difficult.

I mean how can they be guaranteed that they will always have one more cog to start?

How could they possible handle running out of cogs?

I think it would be much simpler to have only one cog.

While we are at it, we should also add interrupts.

After all, that is MUCH simpler and easier to debug!

[sarcasm mode: off]

mindrobots · 2014-03-28 10:52

MOOCH - Make Out of Order Calls to Hub, works for me!!

Bill Henning · 2014-03-28 10:54

My minimum happyness level is something I proposed many weeks ago.

ONE cog can enter hungry/green/recycle/mooch mode. No need for priorities, no need for arbitration, and still gets use ONE C/VM cog that can be faster.

potatohead wrote: »

Frankly, if we don't do it at all, I'm fine with that too. If we get any mor complex than, normal / mooch modes, I'm opposed anyway.

mindrobots · 2014-03-28 10:59

Bill Henning wrote: »

My minimum happyness level is something I proposed many weeks ago.

ONE cog can enter hungry/green/recycle/mooch mode. No need for priorities, no need for arbitration, and still gets use ONE C/VM cog that can be faster.

That introduces a feature that has presumed/potential benefit with a minimum about of silicon into the Propeller universe. Much like what was done with multi-threading and Task3. If it catches on and lives up to its potential, it can be leveraged more in the P3 design.

ctwardell · 2014-03-28 11:21

mindrobots wrote: »

That introduces a feature that has presumed/potential benefit with a minimum about of silicon into the Propeller universe. Much like what was done with multi-threading and Task3. If it catches on and lives up to its potential, it can be leveraged more in the P3 design.

I have to disagree with that assessment.

Limiting multithreading to single task was about the amount of silicon that would be consumed in giving all tasks multithreading ability as opposed to a single task.

Not implementing slot sharing or implementing a cobbled version of it seems to be driven primarily by the concern that it will be abused or misunderstood.

Maybe those are valid concerns.

Just keep in mind that if we leave it out, we are limiting what a user CAN do, based on what a user MIGHT do.

C.W.

Bill Henning · 2014-03-28 11:38

Agreed, and I don't think those are valid concerns.

Reductio ad absurdum:

- P2 might be used to control weapons.
- Weapons can cause significant harm.
- Therefore the P2 must not be produced.

ctwardell wrote: »

I have to disagree with that assessment.

Limiting multithreading to single task was about the amount of silicon that would be consumed in giving all tasks multithreading ability as opposed to a single task.

Not implementing slot sharing or implementing a cobbled version of it seems to be driven primarily by the concern that it will be abused or misunderstood.

Maybe those are valid concerns.

Just keep in mind that if we leave it out, we are limiting what a user CAN do, based on what a user MIGHT do.

C.W.

mindrobots · 2014-03-28 11:53

The only reason I was suggestion a limited implementation in P2 was this:

cgracey wrote: »

The area they'll need is 21.03 square mm. We have 22.9 square mm available. This is good news! There's room for SERDES in there.

Does anyone besides Chip know have an idea how much of this space SERDES will take? He says there's room but how much extra room? Do we know how much a full implemented MOOCH will take? Is there space still for last minute bugs, must haves and "oh sh@t" additions?

Bill Henning · 2014-03-28 12:04

Rick,

Re: mooch/green/etc

I was not aiming at you... but at people who want kill mooch.

At a guess, we have quite a bit of room, as most of the 21.03mm2 would be the hub and the cogs.

FYI, the simple (one cog gets mooch) would take almost no logic as currently it probably goes something like:

if need_hub AND turn == myid then got_hub cycle

would change to

if need_hub AND ((turn == myid) OR (hub_avail))

where hub_avail would be equivalent to

(turn != my turn) && (cog[turn].need_hub == FALSE) ' much simpler to express as a few gates

The complications and extra logic come in when there is an attempt to prioritize who gets spare slots, or to distribute them fairly among several cogs.

The logic required for this (at a guess) is less than 50 gates. Inconsequential, less than 0.001 mm2 at a guess.

Re: SERDES

Depends on how much it tries to do, and how many modes.

Adding quad support to SERDES may take more gates than adding a separate QSPI module, as bit serial vs. 4 bit parallel is too different, and would require many muxes.

Only after determining the feature set could we even guess, and only Chip would know, how much die area it will take.

It may take less to extend XFER to support 4 and 8 bit bus width (my personal favorite).

mindrobots wrote: »

The only reason I was suggestion a limited implementation in P2 was this:

Does anyone besides Chip know have an idea how much of this space SERDES will take? He says there's room but how much extra room? Do we know how much a full implemented MOOCH will take? Is there space still for last minute bugs, must haves and "oh sh@t" additions?

cgracey · 2014-03-28 12:12

I like the idea of MOOCH. You shouldn't depend on a moocher, as Bill said. So, if you enable mooch mode, you may or may not get some extra performance. I think this is a good approach. Really simple, too.

jmg · 2014-03-28 12:57

Bill Henning wrote: »

Adding quad support to SERDES may take more gates than adding a separate QSPI module, as bit serial vs. 4 bit parallel is too different, and would require many muxes.

The separate module argument may apply to small pieces of USB (eg EdgeSync BaudDiv) , but QuadSPI ?

- I do not see that - QuadSPI is still serial - the difference is simply taps in a shift register.
- ie two (or three) lines of verilog, one for 1-bit, one for Quad, (and one for Dual )
Likewise the bit-counter has +1/+2/+4 modes, and the real work is in the Pin-allocation. Nothing saved there, in separate modules.

Importantly, QuadSPI devices usually START working in 1 bit SPI mode, so that 'separate module' needs 1 bit anyway, ( plus must duplicate all the buffers and register controls...)

Bill Henning wrote: »

It may take less to extend XFER to support 4 and 8 bit bus width (my personal favorite).

Parallel mode ports, for LCD/Camera (etc) are a certainly useful, but unless they also include a 1 bit SPI mode, are not that practical for QuadSPI.

Sure, if the MUX cost is low, Parallel mode could include 4/8 bits (along with 12/16/18/24 for LCDs )

Bill Henning · 2014-03-28 13:04

I am favor of adding QSPI to SERDES if it takes less logic and is simpler,

and

I am in favor of having QSPI have separate logic if it takes less logic and is simpler

basically,

whichever takes less logic and is easier

as I defintely would like to have QSPI.

Heck, I'd also love to have an 8 bit XFER mode, as it would help with camera modules, old parallel VGA, and more.

jmg wrote: »

The separate module argument may apply to small pieces of USB (eg EdgeSync BaudDiv) , but QuadSPI ?

- I do not see that - QuadSPI is still serial - the difference is simply taps in a shift register.
- ie two (or three) lines of verilog, one for 1-bit, one for Quad, (and one for Dual )
Likewise the bit-counter has +1/+2/+4 modes, and the real work is in the Pin-allocation. Nothing saved there, in separate modules.

Importantly, QuadSPI devices usually START working in 1 bit SPI mode, so that 'separate module' needs 1 bit anyway, ( plus must duplicate all the buffers and register controls...)

Dave Hein · 2014-03-28 13:11

If it makes people feel better to call hub-sharing MOOCH, and that helps to get it implemented then I'm all for it. I don't care if it's called SCUMSUCKER or some other derogatory term as long as it gets implemented. A single-cog implementation would be nice. More cogs would be nicer.

potatohead · 2014-03-28 13:16

I actually want the passice cycle mooching mode, just to be clear. I didn't intend on denigrating it in any way, just seeking an accurate characterization that cuts down on fear perception.

::

Heater. · 2014-03-28 13:56

@Ramon,

By your definition there is no programmer that is an "adult person".

I have seen seasoned programmers, teams of them, get it to a mess with interrupts and multi-tasking and race conditions and all sorts. This in the world of safety critical avionics software! This is even before we talk about a multi-processor machine like the Propeller.

By you definition I would presume that we don't need structured programming, some kind of "goto" can do all that "if", "while", "for", "repeat" etc constructs can do. We don't need blocks around functions, it's obvious where a function ends by it's return statements. Heck, skip all that syntax checking and use assembler. No, let's use HEX, adult programmers know all their machines instructions, right?

History suggests programmers need protecting from themselves, and in this case from each other.

@Bill,
Not the same at all:

"Bob" writes a new top level Spin object. He pulls in three objects, each of which start three cogs. OOPS.

True, but some simple error checking soon shows when they have over stepped the limit.

"Smith" pulls in five objects that in aggregate use 300KB of hub. OOPS.

True, but the compiler immediately points out the problem.

"Jones" waitpeq's on a pin that will never change, for the wrong state. OOPS.

True. Sounds like correct program operation to me.

How are the above scenarios different from slot recycling?

Because with these timing issues things fail in surprising and unexpected ways and there is nothing to help you.
Your program logic is correct. All the parts tested by themselves work fine. The finished program fails at random!

If they don't Read The Fine Manual, and blindly pull random code together without knowing what they are doing, they can hardly complain if it does not work as expected.

RTFM is not the answer.

Exactly I do expect that if I blindly pull objects from OBEX into my program they should "just work" (Assuming they fit in memory and have pins available). I should not have to study them to find out what's up after fighting with random failures for hours.

We should be targeting programmers & developers, not rank beginners, as beginners will start with a subset of features, and work themselves up.

I keep hearing this statement. It make no sense. All software developers are "rank beginners" all the time. If I create a nice library or object for you to use what do you know about it? Nothing, noobie you. You might find out when you discover it does not work as advertised and have wasted hours integrating it.

...most beginners will start with a P1 anyway - at least in education.

I really hope that is not the idea. PII should be as approachable as PI.

Bottom line is, HUB slot sharing introduces all the problems of interrupts. We don't like the complexity of interrupts and we have multiple cores exactly to avoid the problems of interrupts, ergo, we don't like HUB slot sharing.

Heater. · 2014-03-28 14:00

Chip,

So, if you enable mooch mode, you may or may not get some extra performance.

Changing the name of the solution does not change the problem.

If my object needs "mooch mode" to work correctly. And your object needs "mooch mode" to work correctly, then put them together in a users program and he is screwed.

It's your call of course.

jmg · 2014-03-28 14:06

Bill Henning wrote: »

I am favor of adding QSPI to SERDES if it takes less logic and is simpler,

and

I am in favor of having QSPI have separate logic if it takes less logic and is simpler

basically,

whichever takes less logic and is easier as I defintely would like to have QSPI.

Heck, I'd also love to have an 8 bit XFER mode, as it would help with camera modules, old parallel VGA, and more.

Hmm, I wonder if two QuadSPI's can be 'locked' to manage 8-wide streaming ?

Supporting that should have very minimal logic cost, and I can see users attaching two QuadSPI Flash parts to double the bandwidth.

jmg · 2014-03-28 14:15

Heater. wrote: »

If my object needs "mooch mode" to work correctly. And your object needs "mooch mode" to work correctly, then put them together in a users program and he is screwed.

This is a rare situation, and 'screwed' is not quite correct, as all modules have minimum bandwidths.

If someone simply tries to whack two together, without doing any bandwidth budgets, or reading, then they will learn.
It will still work, just not as fast as they hoped.

(just as they do now, if they whack two of anything together)

If they then DO the bandwidth calculation, and find with some simple co-operation each module CAN 'mooch' quite well, do you really want to exclude that solution, just because ?

If you want proven operation, then a means to check available bandwidth in HW, and some means to report who uses what in software, do not sound too complex.

jmg · 2014-03-28 14:20

cgracey wrote: »

I like the idea of MOOCH. You shouldn't depend on a moocher, as Bill said. So, if you enable mooch mode, you may or may not get some extra performance. I think this is a good approach. Really simple, too.

I can this being used for gains in energy usage, as the system still works fine at the lower bandwidth, but finished faster with the peaks.

Some means to easily support testing at the lower performance would be good - ie you design and test at the lower bandwidth, and enable the power-save option.

Bill Henning · 2014-03-28 14:41

Heater,

Two simple solutions, I strongly prefer #1. Both solve your objection.

1) Obex rule: No mooch objects allowed.

Easy to enforce, the "SETMODE MOOCH" can be found in the pasm source, or even as a hex long, can be scanned for automatically.

2) Only allow one cog to "SETMODE MOOCH". This will waste a lot of potentially reusable unused slots.

Heater. wrote: »

Because with these timing issues things fail in surprising and unexpected ways and there is nothing to help you.
Your program logic is correct. All the parts tested by themselves work fine. The finished program fails at random!

...

Bottom line is, HUB slot sharing introduces all the problems of interrupts. We don't like the complexity of interrupts and we have multiple cores exactly to avoid the problems of interrupts, ergo, we don't like HUB slot sharing.

Bill Henning · 2014-03-28 14:43

Perhaps a global hub flag - disable MOOCH?

I can see it being useful for testing. Could even be implemented as LOCK #31 ... only 0..7 are currently available (I wish all 32 or at least 16 were).

jmg wrote: »

I can this being used for gains in energy usage, as the system still works fine at the lower bandwidth, but finished faster with the peaks.

Some means to easily support testing at the lower performance would be good - ie you design and test at the lower bandwidth, and enable the power-save option.

Bill Henning · 2014-03-28 14:44

Whichever way takes less logic and less of Chip's time

The advantage to using XFER is the engine to do it in the background.

jmg wrote: »

Hmm, I wonder if two QuadSPI's can be 'locked' to manage 8-wide streaming ?

Supporting that should have very minimal logic cost, and I can see users attaching two QuadSPI Flash parts to double the bandwidth.

ctwardell · 2014-03-28 14:50

Bill Henning wrote: »

Heater,

Two simple solutions, I strongly prefer #1. Both solve your objection.

1) Obex rule: No mooch objects allowed.

Easy to enforce, the "SETMODE MOOCH" can be found in the pasm source, or even as a hex long, can be scanned for automatically.

2) Only allow one cog to "SETMODE MOOCH". This will waste a lot of potentially reusable unused slots.

Item 2 cannot stand alone because, "What if someone needs two objects that need MOOCH". So we would need 1 anyway so people can't find MOOCH objects in the Obex. If we need 1 anyway then I wouldn't suggest offering up 2 as a solution.

Then of course if we implement 1 we will encourage a "black market" Obex offering MOOCHing objects...

C.W.

Heater. · 2014-03-28 14:52

jmg,

This is a rare situation..

Is it? We have not lived in a world of HUB slot sharing COGs yet. How can we know that?

'screwed' is not quite correct, as all modules have minimum bandwidths.

As an old Brit I don't use such words lightly. The bandwidth fight is exactly what I'm talking about.

If someone simply tries to whack two together, without doing any bandwidth budgets, or reading, then they will learn. It will still work, just not as fast as they hoped.

That is exactly what I don't want to see.

Imagine designing a circuit and you put an AND gate in there. Then you realise you need another AND gate so you put that in as well. But, what? now the first AND gate fails at random. Or perhaps the second. Or both. Then you search google and the forums and end up asking why it does not work. Silly you, don't you know that if you put a second AND gate in the same project it affects the speed of the first one. And vice versa!

Of course with driver software running in a COG we are working at a much higher level of abstraction than an AND gate but the principle is the same.

(just as they do now, if they whack two of anything together)

Not true. On the Prop One there is no such timing dependency.

If you want proven operation, then a means to check available bandwidth in HW, and some means to report who uses what in software, do not sound too complex.

Actually I think it's an insoluable problem.

Are you up to writing a software analysis tool that can look at the runtime timing of a program and tell if it fails or not. Given that such static analysis has no idea what the runtime inputs and timing may be? I don't think so.

Cluso99 · 2014-03-28 14:54

Bill Henning wrote: »

Perhaps a global hub flag - disable MOOCH?

I can see it being useful for testing. Could even be implemented as LOCK #31 ... only 0..7 are currently available (I wish all 32 or at least 16 were).

The answer is simple (thanks Bill for the idea - just fleshed it out a bit more)...

Use a global equate to globally enable/disable slot sharing. Each cog still requires an instruction to set slotsharing on.

CON
  _clkmode   = xinput
  _xinfreq   = 80_000_000
  _slotshare = 1          ' 1=enabled, 0=disabled

evanh · 2014-03-28 14:54

jmg wrote: »

- I do not see that - QuadSPI is still serial - the difference is simply taps in a shift register.

From what I've read Quad-SPI is parallel. There would need to be a special shuffle circuit to reorder the all the bits being loaded/unloaded into the shifter(s) if a shifter was to be used at all.

ctwardell · 2014-03-28 14:58

Heater. wrote: »

Actually I think it's an insoluable problem.

Are you up to writing a software analysis tool that can look at the runtime timing of a program and tell if it fails or not. Given that such static analysis has no idea what the runtime inputs and timing may be? I don't think so.

You are correct in this assessment, therefore we need to remove the hardware tasks, since they give you slot sharing within a cog.
Without the tasks we will lose multitasking, but hey, we are demanding absolute determinism, so lets be consistent and remove all those bad things.

C.W.

evanh · 2014-03-28 15:06

ctwardell wrote: »

Without the tasks we will lose hubexec, ...

That's not true. Hubexec has no need for threaded cores nor even more than one core.

ctwardell · 2014-03-28 15:09

evanh wrote: »

That's not true. Hubexec has no need for threaded cores nor even more than one core.

You're correct, I meant to say multitasking.

We really should remove hubexec as well though if the goal is absolute timing determinism.

C.W.

Heater. · 2014-03-28 15:17

ctwardell.

You are correct in this assessment, therefore we need to remove the hardware tasks, since they give you slot sharing within a cog.

This is a non sequitur. My arguments have been all about objects and drivers and such, in whatever language, that reside within a COG playing well with other code in a different COG.

Without the tasks we will lose hubexec,

This is not true. I don't see how those two things are related (someone please correct me if I am wrong here)

we are demanding absolute determinism, so lets be consistent and remove all those bad things.

It's not a case of removing all those "bad things". We don't have them yet. It's a case of not introducing new bad things.

Heater. · 2014-03-28 15:25

ctwardell,

You're correct, I meant to say multitasking.

Still sounds wrong. You might have to elaborate on that.

We really should remove hubexec as well though if the goal is absolute timing determinism.

How does hubexec destroy absolute timing determinism?

Anyway the determinism I concerned with when talking about hub slot sharing is perhaps not the same as you have in mind. I'm talking always about the timing independence between what goes on in one COG vs what goes on in another. Thtat is is to say the determinism that two COG processes do not affect each others performance.

P2 - New Instruction Ideas, Discussions and Requests

Comments