Consensus on the P16X32B?

dnalor · 2014-04-06 11:18

jazzed wrote: »

Maybe the "current" P2 is not the P2 they wanted in the first place?

Not maybe !!!

It was interesting to follow the P2 threads, but I always thought, what the hell talk these guys, this can not be the next propeller, there have to be something between.

Brian Fairchild · 2014-04-06 11:32

GordonMcComb wrote: »

The world can function with both a P1+ and P2.

I think that, as P1+ will only be in a TQFP package, your sentence should read...

"The world can function with a P1, a P1+ and a P2."

Brian Fairchild · 2014-04-06 11:39

Bill Henning wrote: »

- 64 entry hub slot allocation table (128 would give even more control)

The more I think about this proposal the more I like it.

potatohead · 2014-04-06 12:30

Would it make sense to put some of the P2 video system in? Perhaps we can get the nice signal quality, double buffer maybe? If it were kept simple, without the color correction, etc... P1 COGS could drive it to some nice resolutions and color depths, perhaps split the middle?

A nice dev station would still be possible then.

And this:

It's funny, as much as I do love the beefed up P16X32B idea, the more I think about it, the more it makes me think why not put 8 P2 cogs in the 180nm P2?

is my thought from the beginning. At 60Mhz, it's a killer chip that would see a lot of uses.

Finally, I'm for what funds our futures no matter what. I just don't know what that is.

Baggers · 2014-04-06 12:45

It wouldn't work potato head, I think the only way forward for now at least is P16X32B.

potatohead · 2014-04-06 12:55

Well we do have DACS on all the pins. With that many COGS, at 180Mhz, 160, whatever, we could just write one that performs well. The good signal quality comes from those circuits. Ideally, we can get PAL working P1 style.

Three COGS does a nice component display, 640 for sure. Maybe 1024

I'm game. That could work with P1 waitvid driving 8 bits, VGA mode into the DAC. Just need three of them. I have some code for that I started on P1 to do component way back when...

Maybe it works anyway.

We just don't get really fast, big programs. But we do get faster big programs.

Phil Pilgrim (PhiPi) · 2014-04-06 13:08

potatohead wrote:

Would it make sense to put some of the P2 video system in?

I think that's inevitable. Chip, please correct me if I'm wrong, but the P1 video might not work with a synthesized design, since all the outputs have to be synchronized to the system clock instead of to an asynchronous PLL.. This would make the video timing tricky, except with some convenient crystal frequency. If the P1 digital video does disappear, I will miss the other oddball uses it's been proven to facilitate (e.g. multi-channel PWM).

-Phil

ctwardell · 2014-04-06 13:27

It starts making one wonder, is it better to try to move some P2 features into P1 style COGs or trim back P2 COGs until they are more power friendly...

I wonder just how bad consumption would be if:

Lose the WIDEs and Quads, toggling that many bits on every hub address change takes considerable power, I think the latest figures from OnSemi has the memories using about 1/3 of the total power.
Do a simple HUBEXEC like Ray has suggested for a P1 type solution.
Examine the impact of tasks on power consumption, if they are significant, lose them.

This would be pretty gutted compared to where we were, but would still give a lot of the P2 features without trying to shoehorn them onto a P1 style COG.

That's it for now, the opium is wearing off...

C.W.

potatohead · 2014-04-06 13:28

Will be interesting to see what Chip does with it.

jmg · 2014-04-06 14:42

ctwardell wrote: »

It starts making one wonder, is it better to try to move some P2 features into P1 style COGs or trim back P2 COGs until they are more power friendly...

I think Chip is already moving in this direction, by doing both.

He is looking at removing the big Mathops from P2 COG into a common area, which both shrinks the P2 COG, and also raises the IQ of any smaller COGs.
It means less total Big Math OPS, but (very?) few designs need 8 copies of these. It does save die area.
It is relatively simple to move carefully selected other P2 opcodes over into the smaller COGs.

altosack · 2014-04-06 16:59

I really like the idea of 16 or so P1E cogs that can be given unequal hub access with a 64-cell array to specify which cogs get what, as suggested by Bill Henning. This also makes an uncached hubexec feasible, particularly if a cog in hubexec mode can be gated on hub accesses, reducing its power usage.

What would be the minimum execution spacing; could a single cog be given every other slot, or even, in an extreme case, adjacent slots ? And how does this affect power usage by having the hub do this at 200 MHz ?

RossH · 2014-04-06 17:20

altosack wrote: »

I really like the idea of 16 or so P1E cogs that can be given unequal hub access with a 64-cell array to specify which cogs get what, as suggested by Bill Henning. This also makes an uncached hubexec feasible, particularly if a cog in hubexec mode can be gated on hub accesses, reducing its power usage.

What would be the minimum execution spacing; could a single cog be given every other slot, or even, in an extreme case, adjacent slots ? And how does this affect power usage by having the hub do this at 200 MHz ?

Hi altosack,

I can't really put you down as a Yes on this basis, since this is outside the "envelope" Chip identified. Should I put you down as a No, or would you accept a P16X32B without this functionality (but maybe with a preference)?

Ross.

Bill Henning · 2014-04-06 17:20

Hi,

- I am quite confident that a P1E could use every fourth hub slot. There is a possibility of *maybe* every third for LMM.

- hubexec with a few helper instructions, just barely, low order probability, be able to use every second hub slot.

For minimum work on the P1E core, the minimum instructions and changes for a 512KB hubexec would be:

1) Hub slot mapping that is the absolute minimum logic needed to make hubexec perform significantly better than LMM.
(and I strongly dislike LR registers, but they need the least hardware support here, so I will even suggest them)
2) JMP D/#hubaddr ' hubaddr is the upper 17 bits of a 19 bit hub address - ie we address the hub as 128K longs, also used for RET
3) CALL D/#hubaddr ' hubaddr is the upper 17 bits of a 19 bit hub address - ie we address the hub as 128K longs
4) LOAD dest ' load register with the value of the long that follows the load instruction, used to load constants/addresses

That's it.

It should be conditional, and it *MUST* store the return address in a known fixed location to avoid the need for a way to specify what register to store the address in

Next step to improve it is to add '@' relative addressing to JMP, CALL, as that allows relocatable code. Relocatable data requires significantly more hardware support.

Edit:

- For best results, there has to be a minimum of twice as many hub slots in the slot array as cogs, ideally four times as many slots as cogs.
- with four times as many slots, having 32 cogs is practical and a good idea, does not "waste" hub bandwdith on serial comms etc
- "mooching" would allow NOT having to align 1/4 - 1/3 of all hub slots to each "hubexec" cog

altosack wrote: »

I really like the idea of 16 or so P1E cogs that can be given unequal hub access with a 64-cell array to specify which cogs get what, as suggested by Bill Henning. This also makes an uncached hubexec feasible, particularly if a cog in hubexec mode can be gated on hub accesses, reducing its power usage.

What would be the minimum execution spacing; could a single cog be given every other slot, or even, in an extreme case, adjacent slots ? And how does this affect power usage by having the hub do this at 200 MHz ?

jmg · 2014-04-06 17:28

RossH wrote: »

I can't really put you down as a Yes on this basis, since this is outside the "envelope" Chip identified.

Of course, the "envelope" Chip identified has already moved since post #1, and will likely continue to move.

RossH · 2014-04-06 17:47

jmg wrote: »

the "envelope" Chip identified has already moved since post #1, and will likely continue to move.

Not that I'm aware of. If Chip comes up with a more concrete proposal I may update the first post, but we don't want the P16X32B to suffer the "death of a thousand cuts" that have affected the P2.

Ross.

altosack · 2014-04-06 18:23

RossH wrote: »

Hi altosack,

I can't really put you down as a Yes on this basis, since this is outside the "envelope" Chip identified. Should I put you down as a No, or would you accept a P16X32B without this functionality (but maybe with a preference)?

Ross.

Hi Ross,

Sorry, this question probably didn't belong in this thread.

In any case, I support going ahead with the P16X32 as you've defined it in post #1.

My dream prop has ADCs and either hardware PWM or enough cogs for 8-phase synchronous PWM (16 would be plenty !).

David

RossH · 2014-04-06 18:31

altosack wrote: »

In any case, I support going ahead with the P16X32 as you've defined it in post #1.

Done: Tally now 43 in favor, 3 against.

Ross.

jmg · 2014-04-06 18:37

altosack wrote: »

My dream prop has ADCs and either hardware PWM or enough cogs for 8-phase synchronous PWM (16 would be plenty !).

Note that real PWM support is outside the original "envelope" Chip identified, but has certainly been suggested by others.
P2 Timers have real PWM, so the support is possible.

Numbers on how many PWMs you need, would help define how many P2-type timers are needed. is that 16 ?

Chip has still to publish full P2 Timer details, would your needs require Sync-across-cogs of the timer-PWM ?

altosack · 2014-04-06 19:30

jmg wrote: »

Note that real PWM support is outside the original "envelope" Chip identified, but has certainly been suggested by others.
P2 Timers have real PWM, so the support is possible.

Numbers on how many PWMs you need, would help define how many P2-type timers are needed. is that 16 ?

Chip has still to publish full P2 Timer details, would your needs require Sync-across-cogs of the timer-PWM ?

To be clear, this is a pie-in-the-sky project that probably will never see volume production. My prototype uses 4 P1 cogs for 16 synchronized outputs, but is restricted to a duty cycle >55% (or <45%, but not both !); using 8 cogs would remove that restriction. So here is a case of 4 P2 cogs being insufficient (using 8 threads will lower the resolution x4 at the same clock speed), unless the PWM modules can be synchronized across cogs.

Of course, I could use a dsPIC GS or 2 as peripheral chips, but that is so... inelegant.

Dave Hein · 2014-04-06 19:37

Ross, I looked at your first post in this thread and I see some people committing to fund the P1+. However, without committing a precise amount this is somewhat meaningless. Someone stating that they would be willing to fund the project may only be thinking of something in the range of a $100 or less. That amount would be somewhat insignificant, and you would probably need to commit to contributing a $1000 or more to make it meaningful. So is that the order of funding you are talking about?

jmg · 2014-04-06 19:44

altosack wrote: »

To be clear, this is a pie-in-the-sky project that probably will never see volume production. My prototype uses 4 P1 cogs for 16 synchronized outputs, but is restricted to a duty cycle >55% (or <45%, but not both !); using 8 cogs would remove that restriction. So here is a case of 4 P2 cogs being insufficient (using 8 threads will lower the resolution x4 at the same clock speed), unless the PWM modules can be synchronized across cogs.

Of course, I could use a dsPIC GS or 2 as peripheral chips, but that is so... inelegant.

The P2 timer has true PWM, so should solve duty cycle issues like above, but I'm not sure on the across-cog support - we are all waiting on DOCs from Chip.

There are common apps that need 6 PWM modulated drives for 3 phase Bridges, would be interesting to see how a P2 timer can address that. Some uC PWM peripherals can include dead-bands on those 6 PWM outputs.

msrobots · 2014-04-06 19:55

Why not?

I spend at least that amount every year at Parallax and I am just a hobbyist. Have not sold any project in my live. Did some but no avail.

And I have not had so much fun programming since the Wang2000 and TRS80 times.

MVS and Cobol are boring compared to the Prop.

This is money well spend to have fun. I recently got new Tires for the red SL500. Spend $1260. And have fun driving it.

Now we need just 60 people thinking like me and a shuttle run is paid for. Other people spend $500 on a FPGA. Would not have spend that money without Chips pre releases.

Why - for FUN!. And that is what I think counts a lot.

Without Parallax my live would be way more boring. Forget crowd sourcing. Forum sourcing!.

Throw in some Tortillas at the Walnut farm or some 3 day FPGA P2 training at Parallax and I chime in with $2000 without problem.

OzPropDev will pay more for a flight down to Rocklin from Down under and would be happy to do that as he stated.

Why? I guess he has a lot of FUN with is FPGA...

Enjoy!

Mike

David Betz · 2014-04-06 19:56

Bill Henning wrote: »

Hi,

- I am quite confident that a P1E could use every fourth hub slot. There is a possibility of *maybe* every third for LMM.

- hubexec with a few helper instructions, just barely, low order probability, be able to use every second hub slot.

For minimum work on the P1E core, the minimum instructions and changes for a 512KB hubexec would be:

1) Hub slot mapping that is the absolute minimum logic needed to make hubexec perform significantly better than LMM.
(and I strongly dislike LR registers, but they need the least hardware support here, so I will even suggest them)
2) JMP D/#hubaddr ' hubaddr is the upper 17 bits of a 19 bit hub address - ie we address the hub as 128K longs, also used for RET
3) CALL D/#hubaddr ' hubaddr is the upper 17 bits of a 19 bit hub address - ie we address the hub as 128K longs
4) LOAD dest ' load register with the value of the long that follows the load instruction, used to load constants/addresses

That's it.

It should be conditional, and it *MUST* store the return address in a known fixed location to avoid the need for a way to specify what register to store the address in

Next step to improve it is to add '@' relative addressing to JMP, CALL, as that allows relocatable code. Relocatable data requires significantly more hardware support.

Edit:

- For best results, there has to be a minimum of twice as many hub slots in the slot array as cogs, ideally four times as many slots as cogs.
- with four times as many slots, having 32 cogs is practical and a good idea, does not "waste" hub bandwdith on serial comms etc
- "mooching" would allow NOT having to align 1/4 - 1/3 of all hub slots to each "hubexec" cog

We also need at least AUGS to allow a 32 bit constant to be loaded. That can't be handled the way it is in LMM. In fact, that would let you use normal 9 bit S field instructions for CALL and JMP but it would require two longs for every CALL or JMP. That's no worse than LMM though which uses a two-word macro.

RossH · 2014-04-06 19:59

Dave Hein wrote: »

Ross, I looked at your first post in this thread and I see some people committing to fund the P1+. However, without committing a precise amount this is somewhat meaningless. Someone stating that they would be willing to fund the project may only be thinking of something in the range of a $100 or less. That amount would be somewhat insignificant, and you would probably need to commit to contributing a $1000 or more to make it meaningful. So is that the order of funding you are talking about?

I deliberately didn't specify, since the funding model itself is not specified. For instance, I might not be willing to "donate" $1,000 but I might be willing to advance purchase $1,000 worth of product etc.

It is intended only as an indication of commitment. With nearly half of all the Yes respondents indicating they would be willing to make some kind of up-front financial contribution, then if it is the case that the main thing stopping the P16X32B is simple financing, then Parallax might consider some kind of crowd-sourcing model. It is not intended to pressure either Parallax or the forum members into agreeing to anything, and I would certainly not expect anyone to be held to any such commitment at this stage.

Ross.

Dave Hein · 2014-04-06 19:59

When you guys are done "designing" the P1+ you're going to end up with something slightly less than a P2. So why not start with the P2, and remove just enough features to get it down to 2 Watts? I think you'd get there a lot quicker than adding features to P1.

Electrodude · 2014-04-06 20:06

David Betz wrote: »

We also need at least AUGS to allow a 32 bit constant to be loaded.

All the more reason not to add hubexec. This is supposed to be a beefed-up P1, not a slightly watered-down P2 that uses just as much power. Besides, there are virtually no opcodes left on the P1 to do anything.

Actually, there is another free opcode besides mul, muls, enc, ones, and hubop. How come djnz and tjnz are different opcodes on the P1? Why isn't tjnz just djnz nr? That would free up an opcode but hurt backwards compatibility. Also, can tjz wr be made to act as djz on the P1E?

Dave Hein · 2014-04-06 20:08

RossH wrote: »

I deliberately didn't specify, since the funding model itself is not specified. For instance, I might not be willing to "donate" $1,000 but I might be willing to advance purchase $1,000 worth of product etc.

OK, so "funding" really means you will get an equivalent amount of product back in return. Sounds kind of like KickStarter. I would still be interested in how much each person is willing to kick in.

potatohead · 2014-04-06 20:09

Some of the ideas, like centralizing and pipelining the math suggest such a path is possible.

RossH · 2014-04-06 20:11

Dave Hein wrote: »

I would still be interested in how much each person is willing to kick in.

If we ever get to that point, so would I. Lots of talk on these forums - let's see how people vote with their wallets!

Ross.

David Betz · 2014-04-06 20:13

Electrodude wrote: »

All the more reason not to add hubexec. This is supposed to be a beefed-up P1, not a slightly watered-down P2 that uses just as much power. Besides, there are virtually no opcodes left on the P1 to do anything.

Actually, there is another free opcode besides mul, muls, enc, ones, and hubop. How come djnz and tjnz are different opcodes on the P1? Why isn't tjnz just djnz nr? That would free up an opcode but hurt backwards compatibility. Also, can tjz wr be made to act as djz on the P1E?

Didn't someone suggest using the P2 instruction encodings for the enhanced P1? That would allow AUGS to be brought over easily. The 17 bit CALL and JMP are a bigger problem though since the P2 only has 256k of hub memory which only requires 16 bits. As I mentioned earlier, you could just use a normal CALL and JMP preceeded by an AUGS instruction but that makes hub CALL and JMP take two longs. Again, this is no worse than LMM on the original Propeller.

Consensus on the P16X32B?

Comments