Radical P2 design changes - discussion only

Phil Pilgrim (PhiPi) · 2014-03-02 23:21

potatohead wrote:

Seriously. People want to make some stuff, not continue to plan for the best thing ever next year, while not making stuff.

I swore I wasn't going to comment on -- or even pay attention to -- the P2 anymore. But the title of this thread was just provocative enough to draw me in. Anyway, when I find myself of one mind with another forumista, I cannot refrain.

Amen, potatohead! Amen!

-Phil

cgracey · 2014-03-02 23:42

Phil Pilgrim (PhiPi) wrote: »

I swore I wasn't going to comment on -- or even pay attention to -- the P2 anymore. But the title of this thread was just provocative enough to draw me in. Anyway, when I find myself of one mind with another forumista, I cannot refrain.

Amen, potatohead! Amen!

-Phil

Phil, this thread is an opium den that spontaneously formed at the on-going, world-wide, round-the-clock, burning-man of Prop2 Contemplations. Welcome.

Phil Pilgrim (PhiPi) · 2014-03-03 00:03

Chip,

LOL! I didn't even need a password to get in, and now I feel like I'm caught in a shrimp trap. Maybe if I pass out, no one will notice that I'm here.

-Phil

JRetSapDoog · 2014-03-03 00:25

cgracey wrote: »

We're going to be getting more room, because I'm taking all the 1.8V logic that is now in the I/O pad and redoing it in Verilog, so that it will be synthesized and become part of the core. This is going to simplify our full-custom layout and make the core area bigger. This also reduces risk, because all that spaghetti logic's timing will be closed by the synthesis tools. I'm thinking, too, that the power rings can be reduced to allow the main 1.8V grid to handle much more of the load. This will probably get us another 4 square mm of area, which we might actually need for the bigger cogs.

This is GREAT NEWS! I know it must be difficult to adjust for those that have laid the groundwork that will need to be changed, but such "inward spiraling" development toward the best compromise is how design often happens, even though it would be nice to just race to the center directly (with very clear foresight). And a lot gets learned along the way, even though much effort gets overturned. After starting a "What Remains to be Done?" thread, I had the thought that a better one would have been a "What Could Go Wrong?" thread, particularly in terms of synthesis and final silicon (hmm...maybe I wouldn't want to know as it would scare me, though please feel free to start one, anyone, if inclined). On (just) browsing VHDL pages, I read that what can be simulated sometimes can't be synthesized. Anyway, upon reading the first sentence of Chip's post, I thought, "Great, less risk!" Then, his post mentions such reduced risk further in, but now I see that the risk reduction is much more than I guessed. I believe that this is the kind of change that could get us a P2 around the end of this year instead of late 2015. And who knew or had any inkling that Chip was working on this change! **Edit: Chip's response made me laugh.

cgracey · 2014-03-03 00:27

JRetSapDoog wrote: »

And who knew or had any inkling that Chip was working on this change!

Not me! I just realized from the first posts in this thread that we really need to get aggressive about freeing silicon area. This has been in the back of my mind for a while, but not in the forefront.

potatohead · 2014-03-03 00:30

Phil, this thread is an opium den that spontaneously formed at the on-going, world-wide, round-the-clock, burning-man of Prop2 Contemplations.

IMHO, this hilarious nugget was worth the entire thread! Well played Chip!

cgracey · 2014-03-03 00:40

potatohead wrote: »

IMHO, this hilarious nugget was worth the entire thread! Well played Chip!

Here, we can view the forumistas, in order of their postings:

ozpropdev · 2014-03-03 00:43

I don't look very well do I?

cgracey · 2014-03-03 00:47

ozpropdev wrote: »

Remove multi-tasking.... I suddenly feel very ill.
It would have to be replaced with something so brilliant to deaden the pain!

You're at peace now.

potatohead · 2014-03-03 00:48

Me neither... I appear to be "that passed out guy" no matter which end I start from.

Phil Pilgrim (PhiPi) · 2014-03-03 00:49

'Fascinating photo! Despite the obvious depravity and obliviousness of their clients, the proprietors have seen fit to post tasteful artwork on the wall. (Or are those P2 docs and mask layouts?)

-Phil

potatohead · 2014-03-03 00:54

That's the documentation on the proposed task swapping system.

Indeed. Great photos. I'm sure the clients appreciated them for a while... maybe they even got better as they passed out!

cgracey · 2014-03-03 00:55

Phil Pilgrim (PhiPi) wrote: »

'Fascinating photo! Despite the obvious depravity and obliviousness of their clients, the proprietors have seen fit to post tasteful artwork on the wall. (Or are those P2 docs and mask layouts?)

-Phil

Those are everyone's latest proposals for resource sharing during multitasking.

ozpropdev · 2014-03-03 01:04

A picture says a thousand words posts!

Cluso99 · 2014-03-03 01:07

cgracey wrote: »

Those are everyone's latest proposals for resource sharing during multitasking.

And those of us who just trying to understand the proposals - I am even too depraved to recognise which one is me

Cluso99 · 2014-03-03 01:10

Now, can we please move on to USB/SERDES and come back to multitasking or whatever once that is done and presuming there is still space available???

potatohead · 2014-03-03 01:10

Thanks for the laughs guys. Off to bed. Have a good week.

BTW: Tonight, I was coding buttons and some interactivity into my fractal program. Toyed with a three COG render too. Fun stuff. So far, I've done it mostly P1 style, employing the much greater throughput and execute speed during blanks, blank lines, etc... The whole thing will fit nicely with no tasks at all. However... it's all over the place. But fun.

A refactor to use the tasking proper, perhaps on the next build image we get, should clean it up nicely. The whole exercise highlights how much better the chip is just with the hardware tasks. IMHO, we got that one right, even if we do have to use the big hammer to TLOCK / TFREE in some cases.

Honestly, I think many of us are not actually employing the capabilities we've got right now. These exclusive times will be on the order of cycles, perhaps tens of cycles the vast majority of the time. Heck, at 200Mhz, it really isn't much. Perspective...

cgracey · 2014-03-03 01:15

Cluso99 wrote: »

Now, can we please move on to USB/SERDES and come back to multitasking or whatever once that is done and presuming there is still space available???

Tomorrow I will finish the task swapping circuitry, add static register remapping, and add SETINDA/B D (register). After those things, it's USB/CRC, then SERDES.

evanh · 2014-03-03 02:46

For my two cents worth I like what I see. I'd like to point out that the current developments were almost forgone steps the moment Chip announced the Prop2 was going to be 8 cogs with pipelining and multi-ported general registers (I guess, going right back to each cog having 9-bit register addressing was pushing in this direction right from the start). Aside from the increased transistor count, the extra horsepower was just begging to be sliced and diced to effect the features expected of a more powerful processor.

Not making allowances for shared execution within a Cog would have seriously nerf'd the Prop2 ... resulting in it being relegated to just an expensive version of the Prop1 with not enough advantages. That might seem a bit harsh but I feel it's not far off the mark.

Also, Hubexec was not something I'd expected in the Prop2, I'd fantasised about it's equivalent in the "Prop3", but as many have implied/said already; it will make the Prop2 stand apart from the Prop1. It's like the Propeller architecture has matured to what it was meant to be with hubexec. I imagine many here are thinking the same ...

Heater. · 2014-03-03 04:20

I don't know about the opium den. I always imagine the Prop II following forumistas more like this:http://www.youtube.com/watch?v=YG8yl_oLvzw

Chip is the quite relaxed one on the opposite side of the fence

Dave Hein · 2014-03-03 05:18

In many, if not most P1 applications, one cog is used to run the main code, and the other 7 run peripheral drivers. Though P2 will probably be used in many different ways, suggestions for improvements for the dominant P1-use case have been deferred. I fear that those suggestions will not be implemented because of the attention to modes that will not be used that much.

EDIT: It also seems that the endless suggestions for changes are postponing an update to the FPGA and the documentation. Some features that were important months ago are in danger of being removed to make room for somebody else's new idea. It's interesting seeing the architecture design in real-time, but it is a bit shocking to see the ad-hoc way features are added and removed.

Heater. · 2014-03-03 05:51

What I'm worried about is how far away from the P1 has the P2 actually become?

If I take my P1 projects and want to run them on P2 how much work am I in for?

I can appreciate that not all P1 source code is going to be compatible with P2. But how much is "not all". Is this a few tweaks here and there or the equivalent of a rewrite?

There is a lot of effort invested in the OBEX. I might have expected one of the first items on the software agenda was getting a lot of that ported over to P2.

I guess I could answer these questions for myself with enough scrutiny of the docs that have been released and playing with the FPGA image.

However my brain froze over a while back in face of the constant changes I'm reading about. I really don't have the time or will to dedicate to it. I have to admit I'm waiting for the big announcement "It's done, frozen, feature complete, go at it for testing".

Dave Hein · 2014-03-03 06:14

Porting P1 PASM code will be simple for some objects, and other objects will require a complete rewrite. The pfth kernel ported easily with just a few changes for code that jumped to return addresses, and the code that accessed the I/O ports. I've been looking into porting the P1 Spin interpreter, and that is not easy. I've made a few tries at it, and gave up after each attempt. I'm currently looking at another approach for doing the Spin interpreter that I should be posting soon.

The biggest concern I have at this point is all the hub and pipeline stalls that I'm seeing. Reducing those will require a significant amount of software work. There have been suggestions to reduce the hub and pipelines stalls, but they have either been ignored or are waiting for the task/thread stuff to be done.

cgracey · 2014-03-03 07:38

Dave Hein wrote: »

...The biggest concern I have at this point is all the hub and pipeline stalls that I'm seeing. Reducing those will require a significant amount of software work. There have been suggestions to reduce the hub and pipelines stalls, but they have either been ignored or are waiting for the task/thread stuff to be done.

Dave,

Do you remember what kinds of ideas have been floated for reducing the hub and pipeline stalls? Hub slot sharing is all that I remember for reducing hub stalls. I can't remember anything for pipeline stalls, though, except talk of holding off RDxxxx/WRxxxx instructions.

Dave Hein · 2014-03-03 08:07

Yes, hub slot sharing is the main feature that I'm referring to. I think this is an extremely important feature, and I'm puzzled why it has fallen below in priority to thread swapping (or is it task swapping?). I think hub slot sharing would be beneficial in almost all applications. The percentage of applications that will use thread swapping would look like a line on a pie chart.

The other feature is the fjmp, fcall and fret instructions that I proposed. I can understand not looking into this feature immediately, since it hasn't really been investigated thoroughly. I'm not sure how many people have actually analyzed the pipeline stalls that their code generates, but I think it's a significant amount. The way to avoid pipeline stalls is to use a delayed jmp/call with three instructions after them that will be executed. In some cases this can be accommodated, but I think there's going to be a large percentage of jmp's/call's where this cannot be done. One or two nop's can be added, but that increases cog memory usage and reduces the advantage of the delayed jump. fjmp, fcal and fret would make it easier to use delayed jumps since only one instruction is needed after the jump. The main drawback to fjmp is that it must be unconditional, which will limit it's use, but there are many cases where unconditional jumps are used.

At this point I'm more interested in seeing a new FPGA image and documentation than adding hub slot sharing or fjmp. I guess that will happen once task swapping circuitry, static register remapping, SETINDA/B D, USB/CRC and SERDES are completed. However, I fear that one or two of those features might require weeks of discussion and hundreds of posts to the forum.

potatohead · 2014-03-03 08:22

Were we not going to enable Bill's HUNGRY COG mode?

After all the discussion, I thought that one was the "sweet spot" option.

Re: stalls

After reading your info Dave, I agree. What I really need to do is run your simulator some. But, I noticed sometimes significant time percentage deltas on code that hops around some, without delayed branching. Hand optimization is similar to what we did on P1 with out of order and alignment with the HUB access windows, but it's a bit more involved now. In my case, I was fitting operations into a small blanking period in a video driver, non tasking mode. Just getting a feel for how the COG runs, before I go there.

cgracey · 2014-03-03 08:27

Dave Hein wrote: »

Yes, hub slot sharing is the main feature that I'm referring to. I think this is an extremely important feature, and I'm puzzled why it has fallen below in priority to thread swapping (or is it task swapping?). I think hub slot sharing would be beneficial in almost all applications. The percentage of applications that will use thread swapping would look like a line on a pie chart.

The other feature is the fjmp, fcall and fret instructions that I proposed. I can understand not looking into this feature immediately, since it hasn't really been investigated thoroughly. I'm not sure how many people have actually analyzed the pipeline stalls that their code generates, but I think it's a significant amount. The way to avoid pipeline stalls is to use a delayed jmp/call with three instructions after them that will be executed. In some cases this can be accommodated, but I think there's going to be a large percentage of jmp's/call's where this cannot be done. One or two nop's can be added, but that increases cog memory usage and reduces the advantage of the delayed jump. fjmp, fcal and fret would make it easier to use delayed jumps since only one instruction is needed after the jump. The main drawback to fjmp is that it must be unconditional, which will limit it's use, but there are many cases where unconditional jumps are used.

At this point I'm more interested in seeing a new FPGA image and documentation than adding hub slot sharing or fjmp. I guess that will happen once task swapping circuitry, static register remapping, SETINDA/B D, USB/CRC and SERDES are completed. However, I fear that one or two of those features might require weeks of discussion and hundreds of posts to the forum.

Thanks for listing those items. Hub slot sharing will get reviewed before we are done. As for FJMP, we need to figure out what happens in a multitasking situation, since in single-tasking the jump occurs two cycles EARLY. I don't know if FJMP's behavior can be standardized in multitasking modes.

potatohead · 2014-03-03 08:30

Can we get an image just before USB / SERDES kick off? Seems to me, those will take considerable discussion, but maybe the others can just be completed with no further input. Maybe Chip can implement however he sees "best case" and we can all go for an upgraded "test drive"

cgracey · 2014-03-03 09:32

potatohead wrote: »

Can we get an image just before USB / SERDES kick off? Seems to me, those will take considerable discussion, but maybe the others can just be completed with no further input. Maybe Chip can implement however he sees "best case" and we can all go for an upgraded "test drive"

Absolutely. I'll get an update out before the USB/SERDES work commences.

Cluso99 · 2014-03-03 16:21

Heater. wrote: »

I don't know about the opium den. I always imagine the Prop II following forumistas more like this:http://www.youtube.com/watch?v=YG8yl_oLvzw

Chip is the quite relaxed one on the opposite side of the fence

hahaha - nice one heater

BTW Who's the forumista trying to jump the fence!

Radical P2 design changes - discussion only

Comments