We're looking at 5 Watts in a BGA!

Ken Gracey · 2014-04-05 20:04

RossH wrote: »

Sorry Ken, I didn't mean to put words in your mouth. I was referring to this post.

Ross.

Was being cynical. Guess it could've been read two ways.

Ken Gracey

David Betz · 2014-04-05 20:04

cgracey wrote: »

Argh! Just when I make up my mind!

We could yet go either way. I feel like the package issue is resolved, though.

Added: As Prop1-types go, I'd rather do 16 than 32 cogs, too.

I still think you need to ask yourself the question of whether the P1E can expand the market for the Propeller. It is very clear that it will appeal to those who already like the Propeller but does it have enough advantages over the current P1 that it will attract people who overlooked the Propeller before for whatever reason?

rogloh · 2014-04-05 20:13

David Betz wrote: »

I still think you need to ask yourself the question of whether the P1E can expand the market for the Propeller. It is very clear that it will appeal to those who already like the Propeller but does it have enough advantages over the current P1 that it will attract people who overlooked the Propeller before for whatever reason?

Very good point. Having both chips would be ideal to cover a larger overall market, but I doubt there are resources for completing both devices in a timely manner. One will obviously have to come first, but which market to address....legacy/future?

David Betz · 2014-04-05 20:21

rogloh wrote: »

Very good point. Having both chips would be ideal to cover a larger overall market, but I doubt there are resources for completing both devices in a timely manner. One will obviously have to come first, but which market to address....legacy/future?

What I'm worried will happen is that the chip will be made that will be most appealing to those in this forum but are we really representative of the target audience? I'd love to have a chip customized to my own needs and wants but that may not be the chip that will expand the Propeller market enough to make Parallax successful and provide the funds to build even more interesting chips in the future. Think about who you want to sell these chips to rather than who types the most suggestions in this forum.

mindrobots · 2014-04-05 20:21

koehler wrote: »

Mindrobots,

sadly this seems like an unfair jab at Ross.
Here you take his words, add timeline and durations to them, and then knock them down like a typical strawman attack.
So far, some "Sarcasm" aside, this forum is one of my favorite on the net due to its generally congenial if heated discussions.

I 'got' exactly what Ross was saying.

If I may give my reading of his post,

Parallax currently has the P1 in production, a known good working physical device that has been proven.
The P2, is currently a device that has already failed a shuttle run, and since that time, has morphed far beyond what that shuttle run device entailed.
So not only is there the potential for still existing errors to be in that part of the original P2, but there are just as likely to be similar problems within all of the new stuff thats been tacked on, and probably far more possiblility of 'subtle' errata.

Chip has already opined earlier that pulling out the old P1 and revisiting it was basically a no-brainer, because it was such a simple design, comparatively speaking. His words, or close to it.

So you do the math.
How hard would it be for Chip to take the P1, and double the Cog's from 8 to 16, knowing that its a proven design?
I think its likely realistically a lot easier for Parallax to do that successfully in a timely fashion than to 'prove' the P2.

If you are in favor of the P2, perhaps because you have spent $$ on a FPGA, then maybe just say so.
Trying to dismiss Ross' comments as wispy, ungrounded in reality is rather unseemly to anyone, and perhaps more so to someone who has given real substance back to the community at large.

It was not meant as an unfair attack at Ross. I greatly respect Ross for the selfless and thankless contributions he makes to the community, Catalina is a brilliant piece of work and were I a C programmer, I would be using it. The terms of "now", "soon", "easy" and other ambiguous terms have been thrown around by many, he was the last poster I read using those terms. I should have prefaced my definitions with "This is what I think" and closed it with "What do you mean by those terms?" If that caused Ross any offense, I am sorry for not wrapping my post properly. I am sorry for getting frustrated by all the technical bickering.

There is a lot of talk about "we" and "need to do" and "it must have" and the terms I used above through all these threads. These are mostly ambiguous and opinionated references. and need to be qualified as I failed to properly ask for in my post that you found so offensive to Ross.

I am not Chip, I do not know how hard it is for him to do any this. I do know is one of the smarter people I have ever met. I do know that he is extremely dedicated to his passion and his profession and works long and hard for us. I do know he has done amazing things with the P2 design adding and changing things as we've fond problems in testing or asked for things to be done differently as the P2 design has developed. I also know that I would never speak for him as to what is hard or not hard, I can't make that judgement. Only he can speak for what is hard and only he can really put any timeframes on the technical deliverables.

I also respect Ken and wouldn't speak for him in terms of what Parallax can or can't do in any business sense. I've talked to Ken before about general business matters and have opinions on business directions just like everyone has opinions. I believe I am allowed to have those and express those.

I've always and often said that I trust Chip to have the vision to make the best choices in the development of the P2 and in this case a P1.5 if needed. You know, "In Chip we Trust" (I actually think I've written that before)

I trust Ken to mange Parallax toward continued growth and success (I may even told him that at one point).

I also think it is fair to ask what the general terms actually mean. As Ken pointed out, the 2 month time frame that Ross was assuming was not a timeframe that had been discussed.

Based on Parallax's past history for product releases, I think my EOY 2014 date for a P1.5 is reasonable, if not optimistic - that is my opinion, not meant to offend anyone or belittle anyone else's opinion.

Ross, please let me know if you are as offended by by comment as Koehler feels you should be. I will apologize to you personally.

As for my joking about the FPGA, it is just that, joking. I have willingly participated in the P2 testing since December of 2012 when I purchased a Nano. It has been fun, educational and an interesting diversion. I just happened to have purchased my DE2 at an amusing time, shortly thereafter, folks are running around the forum thinking the sky is falling.

I have no vested interests in the P2 over a P1.5 - being an FPGA owner or not. I have often stated that I trust Chip and Ken and will be happy with whatever comes out as the follow on Propeller - they need to do what is best for Parallax and I believe they will know what that is when the tough decisions need to be made.

As for being unseemly, that is a new one for me. I don;t think I've ever posted unseemly before or intentionally taken unfair jabs at forum members in over 4000 posts. I've been irrelevant, vapid, vacuous, pointless and even occasionally helpful to other members but never unseemly that I can recall (If anyone does remember, please point it out to me).

I've contributed as best I can to my abilities. My efforts and abilities pale in comparison to Ross and others but I do try and contribute back to the community on a regular basis on a number of projects.

I don't know how to respond to "wispy" because I just don't understand that word in this context.

With all the commenting and debating and challenging of opinions going on in the past few days, I'm not sure why you singled me out. My intent was not hurtful (never has been) and though it was perhaps awkwardly worded, I still believe it is a valid point: Chip, Ken and Parallax as a company are the only ones that can speak to these general terms and all of this is just speculation and wishful thinking until word comes from Parallax.

But don't worry, anyone, I have thick network skin and have not taken offense at any of this directed at me. I will continue to contribute to the community as best I can and just word things more carefully and more tenderly so nobody is offended if I happen to make a mistake or ask the wrong question.

Good night!

cgracey · 2014-04-05 20:22

rogloh wrote: »

One interesting and simple thing you might imagine is the proposed P1 variant running a keyboard driver or a simple UART in a COG. Assuming no hub exec or LMM is used, you manage to squeeze the code into the 512 longs (which we already know can be done in this case) and you then burn a potential 100MIPs to do a keyboard driver. A whole 100MIPS of the device is lost just to decode PS/2 protocol! No problem we have plenty of COGs. LOL!

Now if 512 longs gets too small and you need more memory for your I/O driver or other code you could always try to use LMM and get yourself a 25MIP VM. You now consume 100MIPs for running your 25MIP VM. I'm assuming 1:8 hub cycles per COG and a 200MHz hub with a optimistic 4 cycle VM loop (not even sure that is possible, depends on final jump delay). This is only 25% efficient use of the COG's inherent power when fully loaded and running at 25MIPs, but you are consuming 75% of the total COG power for running the actual VM loop. These are the realities we face with a P1 variant @ 100MHz. It is going to be difficult to utilize all that power effectively and efficiently unless almost everything uses lots of high speed I/O, fits in single COGs and the COGs are working flat out driving the pins and not requiring large amounts of hub memory bandwidth.

So if you don't rewrite and share a bunch of drivers in single COGs (something Cluso didn't want to have to do for P2), what do you have
100 MIPs keyboard/mouse driver COG,
100 MIPs full duplex uart COG,
25 MIPs LMM main application actually consuming 100MIPs of raw power but only getting 25% efficiency at best
etc

So what you can see from this is that the fundamental way to realize the true potental benefit and raw performance of the P1 variant would be to use hub exec and/or tasking. But I imagine this is non trivial change to the P1 variant and will take quite a while to do, so some people will not want to go down that path for expediency (and I would agree there too). So you skip it. End result is you have an updated chip with a whole lot of potential power on paper but quite difficult to realize in practice. Maybe that will satisfy the existing P1 market however when it comes to I/O pin limits and total hub RAM. It does solve that problem at least.

In comparison, in my opinion the P2 seems better balanced and designed for optimizing system performance by providing the ability to share the resources, but I would have to say I still prefer more than 4 COGs if at all possible, assuming they could fit the die size/power envelope. 4 COGs will definitely require more I/O driver sharing, but given what we've seen above it may make sense to do this on the P2. At least a USB host COG on a P2 would allow lots of I/O peripherals to be attached and help reduce the number of COGs consumed. Would FS USB host be possible on P1 variant without extra H/W support? Not sure. Maybe with multiple COGs in parallel. But again lots of potential MIPs used to do it.

If you wanted to save power in a 100 MIPS Prop1 cog, just do a WAITCNT - it will hold power near 0 until CNT is matched. That way, you could get the equivalent power of a 10 MIPS cog, if that was all you needed. Prop1 cogs are so dang simple, you can't believe it. In 2014 they are the equivalent of a 7400.

Ken Gracey · 2014-04-05 20:23

Phil Pilgrim (PhiPi) wrote: »

Chip, I'm not sure why that should even matter. It's not like it's a poll among your potential volume customers --
-Phil

Those customers have asked for:

- more RAM
- faster speed
- code protect
- A/D

. . .and those who didn't use it due to language choices would like efficient use of C.

Ken Gracey

rogloh · 2014-04-05 20:27

cgracey wrote: »

If you wanted to save power in a 100 MIPS Prop1 cog, just do a WAITCNT - it will hold power near 0 until CNT is matched. That way, you could get the equivalent power of a 10 MIPS cog, if that was all you needed. Prop1 cogs are so dang simple, you can't believe it. In 2014 they are the equivalent of a 7400.

Yeah you can do this for sure. Problem is you sort of paid for the other potential 90MIPs in the COG when you bought the device. If lots of other I/O driver COGs do this I feel you've kind of lost out a bit on the value of the chip.

David Betz · 2014-04-05 20:28

Ken Gracey wrote: »

Those customers have asked for:

- more RAM
- faster speed
- code protect
- A/D

. . .and those who didn't use it due to language choices would like efficient use of C.

Ken Gracey

Sounds like that matches well with a P1E that includes hub execution! :-)

mindrobots · 2014-04-05 20:28

koehler wrote: »

Ken,

We know you're lurking about, how about throwing us a bone and help us help you?

You're probably sitting back, hoping Chip can work his usual magic and make this problem go away for the most part.
However, it looks like we're at a full stop, with probably 4 Cogs.

The problem some of us see, is that going ahead with 4 Cog's means multithreading is now required, where as in the past we had something akin to lightweight, disposable Cogs.

The premise of the Prop has been simplicity, and interruptless, single-threading functionality.

The P2 option now available appears to jettison that simplicity, and require multi-threading. Without useful interrupts at that.

The P1b/P16 may be an option, although less powerful, it seems to be closer to the Prop's raison d'

rogloh · 2014-04-05 20:31

Ken Gracey wrote: »

Those customers have asked for:

- more RAM
- faster speed
- code protect
- A/D

. . .and those who didn't use it due to language choices would like efficient use of C.

Ken Gracey

Only the first two are addressed by the P1 variant. P2 probably addresses most of them (not sure about A/D stuff).

jmg · 2014-04-05 20:34

Bill Henning wrote: »

Set the mapping so each task gets 128 longs. Presto, 4 baby cogs.

If the P2 morphs to 4 COGs to better fit the package Power profile, it will need some small 'balancing' tweaks.

1) The design is now light on total timers, so more of those per COG would be needed. eg 4 Nicely matches threads.

The P1 has Memory dominating COG logic, in die area, whilst the P2 has the opposite effect.
The Memory I believe has moved to OnSemi synth, from custom.

2) That gives some scope to adjust a 4 COG P2, to push up that average of 128 longs per task., which sounds light on a chip that will need more Task-packing.

Because one task is not going to need full operand access on another tasks Code, that allows a simple mapping system that does something like
Allocates 50% of GOC memory to shared data, as now - All tasks get RMW on this.
Each Task 'owns' its own code/private data space, which auto-swaps-in on Task select.

No extension to std 9 bit operand opcodes are needed, each Task thinks it has a Full COG

Now, you can have double the code per task, with the simple addition of 3 half blocks of COG ram.
Logic will still dominate a P2 COG.

Indirect opcodes would be able to reach all memory, so a 2 Task design gains Array storage.

cgracey · 2014-04-05 20:34

Ken Gracey wrote: »

Those customers have asked for:

- more RAM
- faster speed
- code protect
- A/D

. . .and those who didn't use it due to language choices would like efficient use of C.

Ken Gracey

This is a good summary of feedback we've received from customers.

Bill Henning · 2014-04-05 20:34

Hi Rick,

With a 4 cog P2 design, multi-threading is very useful, and would perform much better than cooperative pthreads for things like select() in C, and having different threads handle different tcp/ip sockets, and USB endpoints.

That does NOT mean that you can't do that without hardware support for threads - you can - but that it is more efficient with than without the hardware support.

As it turns out, we get support for threading for free with support for enhanced debugging; or if you prefer, we get support for enhanced debugging free with support for threads. It turned out that once you boiled it down, the infrastructure for both was the same, and required little extra logic.

mindrobots wrote: »

I don't think multi-threading is required, I believe hardware multi-taksing would solve the problem. In 4 instructions (SETTASK, JMPT1, JMPT2, JMPT3, you have easily split 1 cog into 4 cogs. As long as you pay attention to the effective clock rates and which things you can't multi-task, it seems to be a simple solution for multiple peripheral drivers in a single COG.

If the gurus have found otherwise, then as always, I defer to them. In the defense of the concept, if I was able to do it, then most anyone that can code PASM can do it. As for how it will be handled in other languages, we can;t speak to that until there are other running language implementations.

mindrobots · 2014-04-05 20:35

rogloh wrote: »

Only the first two are addressed by the P1 variant. P2 probably addresses most of them (not sure about A/D stuff).

There was talk about using the P2 I/O pin in the P1 variant. The P2 I/O pins have significant built in features including DAC/ADC support. I think those would cover most ADC/DAC concerns Ken has heard.

(not speaking for anyone but myself based on my understanding of the technology)

cgracey · 2014-04-05 20:39

rogloh wrote: »

Yeah you can do this for sure. Problem is you sort of paid for the other potential 90MIPs in the COG when you bought the device. If lots of other I/O driver COGs do this I feel you've kind of lost out a bit on the value of the chip.

It's just silicon, though. That it can run way faster than what you need doesn't make it any more expensive. It runs fast just because it's so simple. The speed is free. The parallelism costs something. It could be said that you wouldn't need the parallelism if you could run a single thread fast enough, but everyone's here because they like the parallelism.

Bill Henning · 2014-04-05 20:41

Chip,

In the p2 verilog, can I assume that if you had four tasks, each receiving 1/4 time slices, they would get 1/4 of the hub slots for the cog?

So if the cog memory was mapped as 128 longs, as the 4 cog P2 will get a hub cycle every 4 clock cycles, the four equal priority tasks could count on a hub slot every 16 cycles?

If so, in that mode, cog tasks would be 100% deterministic (as long as they avoid CORDIC MUL DIV) and would for all intents and purposes be four 128 long deterministic baby cogs. That could even be loaded with totally independent driver objects.

Food for thought.

Then a 4 cog P2 could be used as:

1x 100MHz hubexec/threaded cog (4x-15x LMM performance due to hubexec, 800MB/sec hub bandwidth potential)
1x 100MHz video/dram cog (800MB/sec hub bandwidth potential, more than enough for 1080p60 32 bit)
8x 25MHz fully deterministic baby cogs with 128 longs each for drivers, 200MB/sec hub bandwidth each (actually deterministic tasks)

Mooch would potentially increase that 4x-15x performance (compared to LMM) by another factor of 2x-3x :-)

cgracey wrote: »

This is a good summary of feedback we've received from customers.

mindrobots · 2014-04-05 20:41

Bill Henning wrote: »

Hi Rick,

With a 4 cog P2 design, multi-threading is very useful, and would perform much better than cooperative pthreads for things like select() in C, and having different threads handle different tcp/ip sockets, and USB endpoints.

That does NOT mean that you can't do that without hardware support for threads - you can - but that it is more efficient with than without the hardware support.

As it turns out, we get support for threading for free with support for enhanced debugging; or if you prefer, we get support for enhanced debugging free with support for threads. It turned out that once you boiled it down, the infrastructure for both was the same, and required little extra logic.

I'm with you, Bill. The multi-threading will be a big help for the use cases you point out but it is a bit more complicated to implement in your code than 4 hardware tasks in a COG.

My point to the original poster who had stated concerns about the complexity of multi-THREADING was that a COG could easily be split into four by trying the very simple to use hardware multi-TASKING that the P2 also has. Four instructions and on your way with four tasks is a pretty cool feature!

cgracey · 2014-04-05 20:42

David Betz wrote: »

Sounds like that matches well with a P1E that includes hub execution! :-)

It does, doesn't it!

I looked into that the other day and it would probably double a Prop1 cog's complexity. Could be done, though, and we'd still be maybe 1/6 a Prop2 cog's size.

cgracey · 2014-04-05 20:44

rogloh wrote: »

Only the first two are addressed by the P1 variant. P2 probably addresses most of them (not sure about A/D stuff).

With the new I/O pins, a Prop1 cog could do A/D using the existing CTRs.

rogloh · 2014-04-05 20:49

Bill Henning wrote: »

Chip,

In the p2 verilog, can I assume that if you had four tasks, each receiving 1/4 time slices, they would get 1/4 of the hub slots for the cog?

So if the cog memory was mapped as 128 longs, as the 4 cog P2 will get a hub cycle every 4 clock cycles, the four equal priority tasks could count on a hub slot every 16 cycles?

If so, in that mode, cog tasks would be 100% deterministic (as long as they avoid CORDIC MUL DIV) and would for all intents and purposes be four 128 long deterministic baby cogs. That could even be loaded with totally independent driver objects.

Food for thought.

Then a 4 cog P2 could be used as:

1x 100MHz hubexec/threaded cog (4x-15x LMM performance due to hubexec, 800MB/sec hub bandwidth potential)
1x 100MHz video/dram cog (800MB/sec hub bandwidth potential, more than enough for 1080p60 32 bit)
8x 25MHz fully deterministic baby cogs with 128 longs each for drivers, 200MB/sec hub bandwidth each (actually deterministic tasks)

Mooch would potentially increase that 4x-15x performance (compared to LMM) by another factor of 2x-3x :-)

Interesting idea. I also wonder just like the P1 variant if the power budget allowed the P2 hub to be run at 200MHz while the COGs go at 100MHz and just what that might open up....

cgracey · 2014-04-05 20:49

Bill Henning wrote: »

Chip,

In the p2 verilog, can I assume that if you had four tasks, each receiving 1/4 time slices, they would get 1/4 of the hub slots for the cog?

So if the cog memory was mapped as 128 longs, as the 4 cog P2 will get a hub cycle every 4 clock cycles, the four equal priority tasks could count on a hub slot every 16 cycles?

If so, in that mode, cog tasks would be 100% deterministic (as long as they avoid CORDIC MUL DIV) and would for all intents and purposes be four 128 long deterministic baby cogs. That could even be loaded with totally independent driver objects.

Food for thought.

Then a 4 cog P2 could be used as:

1x 100MHz hubexec/threaded cog (4x-15x LMM performance due to hubexec, 800MB/sec hub bandwidth potential)
1x 100MHz video/dram cog (800MB/sec hub bandwidth potential, more than enough for 1080p60 32 bit)
8x 25MHz fully deterministic baby cogs with 128 longs each for drivers, 200MB/sec hub bandwidth each (actually deterministic tasks)

Mooch would potentially increase that 4x-15x performance (compared to LMM) by another factor of 2x-3x :-)

Any instruction that does a hub r/w will stall the pipeline until it's done. It's first-come/first-serve, however that crumbles.

jmg · 2014-04-05 20:50

cgracey wrote: »

It does, doesn't it!

I looked into that the other day and it would probably double a Prop1 cog's complexity. Could be done, though, and we'd still be maybe 1/6 a Prop2 cog's size.

Which could open up a 4 COG P2, with 6 (8?) x P1HE/P1E COGS in the space of the 5th P2 COG. ?

(and the remaining 3 P2 COGs, morphed into RAM )

Bill Henning · 2014-04-05 20:50

There is one fly in the ointment.

Please see my benchmark thread.

P1E with simple hubexec (no I & D caches, no quad or wide, no pointer instructions etc) will only provide approximately 25% speed increase over LMM.

Now 25% is not to be sneezed at, but a P2 cog hubexec is roughly 4x-15x faster than P16E32 LMM that uses FCACHE, and hubexec does not need FCACHE support in the compiler.

cgracey wrote: »

It does, doesn't it!

I looked into that the other day and it would probably double a Prop1 cog's complexity. Could be done, though, and we'd still be maybe 1/6 a Prop2 cog's size.

David Betz · 2014-04-05 20:51

cgracey wrote: »

Any instruction that does a hub r/w will stall the pipeline until it's done. It's first-come/first-serve, however that crumbles.

Why won't self-looping work here? I guess you'd have to arbitrate between multiple tasks waiting for the hub at the same time.

Rayman · 2014-04-05 20:55

Strange they didn't ask for more pins... That's what I really want most, I think.

Ken Gracey wrote: »

Those customers have asked for:

- more RAM
- faster speed
- code protect
- A/D

. . .and those who didn't use it due to language choices would like efficient use of C.

Ken Gracey

Bill Henning · 2014-04-05 20:55

Thanks, got it.

If I am correct, that means that if all four tasks are trying for hub access (in every 16 cycle window) it would pretty much degenerate to each of the four tasks getting one slot.

Next question:

If you go for a 4 cog P2, would it be difficult to add a round-robbin mode to the tasks access to the hub? (ie cog gets 1/4 hub cycles, tasks get at those slots in a round-robbin manner)

The reason that I ask, is if that can be made deterministic, the tasks would be fully deterministic (except CORDIC / MUL / DIV) - just as deterministic as P1 cogs.

And that would address the biggest concern that the P16E32 proponents posit.

cgracey wrote: »

Any instruction that does a hub r/w will stall the pipeline until it's done. It's first-come/first-serve, however that crumbles.

Electrodude · 2014-04-05 20:58

David Betz wrote: »

Why won't self-looping work here? I guess you'd have to arbitrate between multiple tasks waiting for the hub at the same time.

Then what would happen if two tasks wanted the hub at the same time? A queue or some other mess? Round robin would be probably bad because of how tasks are implemented. Blocking is the simplest way.

cgracey · 2014-04-05 20:58

The Prop2 cogs have some state-machine features that make it able to do some things that no number of Prop1 cogs are going to accomplish:

RGB-based video with color-space conversion
Pin transfer to/from WIDEs - fast external memory I/O
Goertzel algorithm in CTRs with dithered DAC output
CORDIC computer for circular functions
Single-clock MAC with auto-increment-and-wrapping pointers

These cogs are sleek and make the Prop1 cogs look like farm tractors.

mindrobots · 2014-04-05 20:59

Rayman wrote: »

Strange they didn't ask for more pins... That's what I really want most, I think.

I thought it was funny at some point today on one of the threads, the question of resources being exhausted came up: pins, memory or cogs. Within a few posts, three different people came up with the three resources in three different orders.

Just goes to show you....ask a typical users...and then go ask a few more!

We're looking at 5 Watts in a BGA!

Comments