Propeller II update - BLOG

Heater. · 2014-01-16 10:40

potatohead,

... unless I see additional complexity being advocated. We really don't need it.

I second that. What ever happened to he KISS principle? This P2 is far too complex already.

Bill Henning · 2014-01-16 10:44

DISABLEHUNGRY ' default, turns of HUNGRY
ENABLEHUNGY n ' hub instruction, enables hungry for cog N exclusively

Enforces single hungry cog rule for simplicity.

Prop defaults to no cog being hungry.

Enforces only one cog can be hungry at once.

Simple.

Still gets the extra speed for the "main app" / "SuperCog"

Heater. wrote: »

It is not possible to enforce that "only one COG hungry rule".

I'm going to assume that my language provides a function/method/macro/intrisic to allow me to use whatever hungry instructions I need. Like we have for many other hardware specific features.

If I write a function that contains a "HUNGRY" instruction the compiler cannot tell which COG will end up running that function. Or how many running COGS will call the function that contains that "HUNGRY" statement. Or actually if that HUNGRY statement ever gets executed in the whole execution life of the program.

Static analysis by a compiler can not determine if that HUNGRY is used, or used more than once, at run time.

Such a rule would require hardware support. Think of it like claiming a lock. And please God let's not go there.

Roy Eltham · 2014-01-16 10:50

If there is going to be a "hungry mode" for COGs that uses unused HUB slots, then having the compiler limit the feature of the chip arbitrarily seems stupid. I can't see that being a reasonable part of any "solution". Why limit it to only one? what if you are only running 2 COGs? Why not let them all run in hungry mode? Same could be said for 3 or 4, or more, as long as you have either unused COGs or COGs not using hub, then why not let the other COGs use the free slots?

If Chip decides to implement a "hungry mode" for COGs then he'll likely do it in a way that makes sense and won't need the compiler to arbitrarily limit it.

I was against the idea initially, and I am still against any complicated sharing/donating/cooperative modes thing where COGs need to "give up" their slots to other COGs. I think Bill's super simple thing where a COG can be configured to attempt to use any free slots is acceptable. Although, I think it has more uses than he says, because I can imagine it being very beneficial for doing things like capturing images from high speed camera modules and dumping the results into hub, or making a super version of the propscope. It does have the issue that if you write a piece of code that won't work properly without getting free extra slots and you put it together with 7 other pieces of code that consume all the hub slots for their COGs, then it won't work properly. This is a pretty unlikely case, especially since all 8 COGs consuming all HUB slots fully would not be satisfied by 256K of memory in all but weird contrived cases (that I can think of).

In practice, the "concern" over OBEX objects not working when mixed and matched is very unlikely to ever be problem.

My main worry over this feature right now is that it doesn't seem as easy to do and make actually work all around as some people are hinting at, so it could take a long time and be risky. Chip would need to answer that. I'm sure it's different now than it was before, because of all the changes to COG/HUB interactions with the widening to 8 from 4 LONGs, the 4 cache lines being added, and new instructions for driving those. We'll see...

ctwardell · 2014-01-16 11:40

I don't agree with limiting HUNGRY to one cog.

Keep in mind that not only will a HUNGRY cog not always use available shared slots, it may also not always use its own slots.

It's only going to use the shared slot if it happened to have a hub op pending, and in some cases won't need a hub op during it's own slot because it just got what it needed using the shared slot.

C.W.

jmg · 2014-01-16 11:48

Roy Eltham wrote: »

I was against the idea initially, and I am still against any complicated sharing/donating/cooperative modes thing where COGs need to "give up" their slots to other COGs. I think Bill's super simple thing where a COG can be configured to attempt to use any free slots is acceptable. Although, I think it has more uses than he says, because I can imagine it being very beneficial for doing things like capturing images from high speed camera modules and dumping the results into hub, or making a super version of the propscope. .

Correct - the upside to this feature is large.... - and it gives the designer control.

Roy Eltham wrote: »

My main worry over this feature right now is that it doesn't seem as easy to do and make actually work all around as some people are hinting at, so it could take a long time and be risky. Chip would need to answer that. ..

Chip did seem to have a method pretty much all mapped out, that seemed simple to me.

Roy Eltham · 2014-01-16 11:52

The discussions I remember with Chip on this, where more about the design of the feature, not the actual implementation, and at the time it was before HUB slots were widened to 8 longs, hub exec mode, and the cache lines for that. So it may be more complex now to make it happen.

potatohead · 2014-01-16 12:42

Exactly Roy. We have not really seen anything from him as of late. We may find it is a PITA still. Nobody knows.

@Jmg How much propeller programming have you done? I see things like "correct" and "control" written from some position of authority. I'm asking because the rest of us have been involved with these chips, programming, hardware, etc... since near, if not day one. A brief look at your posts reveal little, but your comments on what a P2 should look like.

To be clear, those are not unwelcome, however the implication that you have it all figured out is a bit presumptious, unless you have been doing development.

Additionally, many of us have specific plans for the P2, again largely based on our very favorable P1 experiences. Is that the case for you here perhaps?

Just wondering. I really don't mind if you don't have either going on, but I really do mind the implications I've mentioned without some basis.

Maximum control is not always a good thing.

Cluso99 · 2014-01-16 15:18

Ponder this for a moment...

All cogs are set to hungry (I dislike his term).

What is the result?

My expectation is that all cogs would benefit. As each cog requires a slot, it would receive the next available slot, which is likely to be before its slot came around. Most likely thiswould mean that it wouldnow notrequire its own slot, making this available to another cog, thus improving that cogs performance too.

The result would be like slots being allocated on first come firstserved. All cogs would see a performance increase.

BTW Please don'tdismiss the "donate" slots between co-operatingcogs as this can be done deterministically, and provides significant performance increases.

bartgrantham · 2014-01-16 15:31

(long time lurker but this conversation is too compelling to stay quiet)

If one wants to argue for a cog being able to utilize unused hub access slots I'm not sure what the point of proposing a "hungry" mode is. It seems to me that if you want to make hub access fluid then pending hub read/writes by cogs should be governed by LRU. In other words, all cogs are hungry. Or none are. Depends on if they want access to the hub RAM. If all cogs want access, they end up time slicing by 8. If only two do, they access the hub back and forth. If it's just one 99% of the time, then it gets 99% bandwidth.

This is a generalization/restatement of the idea, which feels much more natural to me as a developer. Unfortunately it throws a huge monkey wrench into a prime design criteria, which is simplicity and determinism of timing for cogs, and it probably represents a huge amount of redesign for Chip.

Heater's concern (which I agree!) is that people will start building/publishing/sharing cog programs that require >1/8 hub bandwidth when 1/8 is the only guaranteed amount of bandwidth they can rely on, making module/object compatibility a complex issue.

Bill's concern is that we're leaving performance "on the table", which is doubly important for key bandwidth-constrained uses such as VM's, Spin, video kernels, etc. (which I also agree with!).

It seems like satisfying both is something of a cultural issue (OBEX only accepting non-hungry objects, for example). But are there any other solutions?

mindrobots · 2014-01-16 15:32

Will it become a "hungry" free for all with everybody getting in worst case a 1 in 8 chance at a slot instead of an orderly round-robbin with a possible 7 slot stall if you miss??

Tubular · 2014-01-16 16:28

bartgrantham wrote: »

It seems like satisfying both is something of a cultural issue (OBEX only accepting non-hungry objects, for example). But are there any other solutions?

Well said, Bart. Many of us are a bit too involved/fatigued/both to recognize there are additional dimensions to this, certainly cultural.

I'd like to throw the "longevity" dimension in there too. When we're (up to) 7 years hence and thrashing out the P3 details, but still making do with a dated P2 that needs to generate the cash for P3, we may be really wishing some of these extended features made the P2 cut.

I think Chip has proven he has a good sense of what is sufficiently elegant to make the cut. Many of the "advanced features", such as tasks, counters, video generation, waveform generation, data transfers "get out of the way" quite beautifully and won't burden or scare newbies. It doesn't sound like we're "quite there" with hub design, but there will be an elegant solution to this, too.

This whole P2 thing is quite an exercise in faith (I'm sure Ken would agree) but I think we're in good hands.

jmg · 2014-01-16 16:37

bartgrantham wrote: »

It seems like satisfying both is something of a cultural issue (OBEX only accepting non-hungry objects, for example). But are there any other solutions?

From his comments above, Chip already has a slot sharing scheme ready to implement, but if you wanted to cast a wider net for solutions, there is also an existing/proven method of control of time-slices used by the Threads inside a COG.

It could be that the same slot allocate could be used across-cogs, which would need a global register, and that would init to assign HUB-slot-density to each COG.

Ariba · 2014-01-16 16:43

bartgrantham wrote: »

Bill's concern is that we're leaving performance "on the table", which is doubly important for key bandwidth-constrained uses such as VM's, Spin, video kernels, etc. (which I also agree with!).
... But are there any other solutions?

I think there are other solutions: Don't have the code, the variables and the stack all in hubram. With the Prop2 you can get a very high performance with executing big code from hubram, hold the variables in cogram and the stack in auxram (formerly known as stackram).

This has its limits but the Prop2 is still a microcontroller and not made for Linux or something like that.

Andy

rogloh · 2014-01-16 17:37

Ariba wrote: »

With the Prop2 you can get a very high performance with executing big code from hubram, hold the variables in cogram and the stack in auxram (formerly known as stackram).
Andy

You don't know just how much I am looking forward to being able to use this type of model. On P1 I've spent countless hours and hours chasing down every last opportunity to find weird tricks and restructure code to save a Long here or there in the COG either to fix a bug or add in some new feature (especially with various hires video drivers which need to store whole lines of video data). While initially it can be quite an interesting and challenging mental puzzle to solve, after you do it enough times it gets a bit tedious and the resulting code often becomes quite tricky to follow whenever you come back to it later. Lots of cool P1 drivers I found (e.g. Audio COGs, KBD/mouse) are jam packed in the COG to maximize its capabilities but this makes them difficult to extend.

With hub mode we now have the luxury of so much more room for running larger amounts of PASM code at speed and once you don't need to execute RDLONG/WRLONGs etc from PASM you should get pretty high performance with all the COGs dedicated hub access slots being available for the instruction prefetcher, and this is even without any hungry mode which, if implemented, could only help further. I also suspect with some effort you could run PASM in hub mode deteministically if you are always aware of particular jump positions, what will already be cached, and the alignment of the PASM (a little like you needed to on P1 with variable hub reads/write timings and the hub window boundary). Probably turning the caching off may actually help figure out how to make your code deterministic, but the code will obviously then run slower with all jumps out of the current octal row triggering another read from the hub. For some applications that may be entirely acceptable. Not everything is going to need the full 200 MIPs.

This feature is going to open up so much more and I can't wait to use it!

bartgrantham · 2014-01-16 19:03

Ariba wrote: »

I think there are other solutions: Don't have the code, the variables and the stack all in hubram.

Yes, and while I was at first dismayed when I heard about hubexec, Im now excited about the flexibility offered between pure-cog, hybrid cog/hub, and pure-hub execution models.

FWIW, I had an instinctive knee-jerk reaction when reading Bills arguments for a cog being able to access the hub in other cogs unused hub windows, but his point about performance speaks very, very strongly. Being able to run an interpreter at up to 8 times faster through this one change is worth discussing in greater detail!

If it helps, heres the mental calculus that changed my mind on this. Before, the contract that was made for the cog programmer was this:

Initial hub access will take between 1 and 8 cycles, after which you can rely on every access thereafter taking exactly 8 cycles.

With hungry mode (I prefer LRU hub access) the contract would be:

Hub access will take between 1 and 8 cycles. No guarantees are made about the timing of subsequent access except that it will take 8 or fewer cycles.

(side note: this also means that one can no longer rely on interleaved hub memory access for tight parallelization)

Sounds like a strong technical win. But theres a kind of engineering and design discipline that has developed under the bandwidth-constrained P1 that I think most people would like to preserve, leading to fears of SuperCog objects. Thats why I suggest its a cultural issue. If its possible to implement in time, the question to me is how to clearly and loudly broadcast the message that cogs that access more than 1/8th hub bandwidth are playing with fire.

Ariba wrote: »

This has its limits but the Prop2 is still a microcontroller and not made for Linux or something like that.

I'm sure I'm not the only one who is daydreaming of building a general purpose computer with Prop2 at it's heart with a full-blown new OS, applications, the works. There isn't much than a P2 can't do that any other processor can, it just does it differently in many cases. The biggest missing piece appears to be an MMU, but even memory protection can be ... negotiable.

potatohead · 2014-01-16 19:51

I was against the idea initially, and I am still against any complicated sharing/donating/cooperative modes thing where COGs need to "give up" their slots to other COGs. I think Bill's super simple thing where a COG can be configured to attempt to use any free slots is acceptable.

Yeah, me too. Let's do it just as simple as that. COG can be hungry, or a moocher, or whatever term we assign to it and have a "no" switch for testing. That's it. Honestly, we don't need any more to get the performance benefits.

Re: 8x faster intrepeter

Well, that's one of those expectations things playing out right there. Yes, maybe if we don't have the other COGS actually busy in the HUB, but we shall see.

Re: modes, donate, etc...

I'm opposed and would rather see nothing than a mess.

I'm sure I'm not the only one who is daydreaming of building a general purpose computer with Prop2 at it's heart with a full-blown new OS, applications, the works.

And another one of those expectations things playing out. I think nice little hobby computers can be made with this chip. More than that seems completely unrealistic to me personally.

bartgrantham · 2014-01-16 23:10

potatohead wrote: »

I think nice little hobby computers can be made with this chip. More than that seems completely unrealistic to me personally.

Oh sure. I was thinking about something like an Amiga 500.

cgracey · 2014-01-17 01:24

Thanks for thinking about this slot sharing issue, Everyone. I don't have any definite thoughts on it yet, myself, but I'll get there soon.

I'm getting the hub stack CALLA/CALLB/RETA/RETB instructions implemented now and I hope to have them done tomorrow. There are a few other, more minor things to finish implementing after that to round out hub execution: some instruction to take a 16-bit relative hub address and make it into an absolute 18-bit address that can be used as a data pointer, and then the AUGI instruction which affords 32-bit constants in-line.

David Betz · 2014-01-17 03:51

cgracey wrote: »

Thanks for thinking about this slot sharing issue, Everyone. I don't have any definite thoughts on it yet, myself, but I'll get there soon.

I'm getting the hub stack CALLA/CALLB/RETA/RETB instructions implemented now and I hope to have them done tomorrow. There are a few other, more minor things to finish implementing after that to round out hub execution: some instruction to take a 16-bit relative hub address and make it into an absolute 18-bit address that can be used as a data pointer, and then the AUGI instruction which affords 32-bit constants in-line.

Sounds like you're getting close. Congratulations! It's been a long haul but very interesting to watch from the sidelines. Dare I ask what ever happened to the idea of a CALL instruction variant that places its return address in a known COG location? Is that in there already or still planned?

Heater. · 2014-01-17 04:36

What is the current size of the P2 instruction set?

I'm kind of worrying because it was huge last time a took a peek and it seems to be growing and growing.

Is anyone expected to be able to keep all that in their head and make sensible use of it?

Certainly compilers will only use a small subset of common instruction types and perhaps offer intrinsics/functions/macros for some odd instructions that are generally useful.

mindrobots · 2014-01-17 06:04

I counted 312 mnemonics in the latest reference I have but it doesn't have the HUBEXEC codes in there or AUGI (you have to like and instruction set that has one named AUGI!). A lot of them are 'A' or 'B' variants or '1','2','3','4' variants so it really isn't that large. Once they are grouped into functional families, it will be an even smaller list. It has gone past simple and elegant and grown to large and still hopefully elegant.

There will be the commonly used instructions for both PASM coders and compiler writers which will intersect. There will be a group of really cool P2 instructions that people jump on and commonly use and there will be special feature instructions that you'll seen used very rarely and may get forgotten by the common coder.

Can it all be kept in your head? No, at least not *THIS* head!

We'll need a PASM IDE with code completion and some mnemonic lookup features!

David Betz · 2014-01-17 06:18

Seems like even so-called RISC processors have huge instruction sets these days. Look at the instruction set for the ARM. This is one reason I wanted to play with the P1 RTL. I wanted to see how little I could add to it and still come up with a powerful processor. The existing P1 already has the simplicity and elegance feature.

Edit: And, I didn't mean to imply that P1 wasn't already a powerful processor. I just always wanted more code space and I wonder if the hub exec idea could be added to P1 without some of the other changes that greatly increased the complexity of P2. Also, I'm not saying that the P2 complexity is bad. It's just that I'm a minimalist at heart and I'd like to see how little could be added to P1 and still result in a significant leap in code capacity and performance.

Heater. · 2014-01-17 06:48

I haven't looked in detail but didn't ARM go mad with the "thumb" instruction set to produce smaller code. Then they threw in hardware execution of Java byte codes (Does anyone even use that?). Then they have half a dozen different floating point implementations going on and different generations of SIMD instructions. Chaos.

I guess 300 odd instructions on the P2 comes down to 150 different operations or so when we take into account A/B variants that can easily be memorized. That is still a lot though.

Now what about all those execution modes:
1) In COG
2) In HUB
3) Single threaded
4) Hardware threaded
5) Stack in HUB?
6) Stack in AUX
7) Greedy or not?

That's a lot to keep track of. And what instructions can I use in what modes and how does everything interact with everything else?

How many of these "modes" will, say, propgcc ever support.

I'm just feeling overwhelmed at the moment. Tell me it's just a case of not being able to see the forest for the trees.

David Betz · 2014-01-17 07:00

Heater. wrote: »

4) Hardware threaded

This is the one that worries me a bit because the last time I followed the discussion of hardware tasks there were restrictions on which instructions could be used and also that any hub instruction could stall the entire pipeline including the other hardware tasks. I'm not sure if that is still true in the current design.

Seairth · 2014-01-17 07:38

Heater. wrote: »

I'm just feeling overwhelmed at the moment. Tell me it's just a case of not being able to see the forest for the trees.

No, I think your feelings are legitimate. I've been arguing this for months (even before the addition of some of these features). And, for better or worse, there's no "going back".

Seairth · 2014-01-17 07:42

David Betz wrote: »

This is the one that worries me a bit because the last time I followed the discussion of hardware tasks there were restrictions on which instructions could be used and also that any hub instruction could stall the entire pipeline including the other hardware tasks. I'm not sure if that is still true in the current design.

I believe that the normally-blocking instructions will instead self-jump (to avoid a pipeline stall) when in multitasking mode.

David Betz · 2014-01-17 08:04

Seairth wrote: »

I believe that the normally-blocking instructions will instead self-jump (to avoid a pipeline stall) when in multitasking mode.

Does that include rdlong, etc?

Seairth · 2014-01-17 08:11

David Betz wrote: »

Does that include rdlong, etc?

I believe so. Though, I have no idea how that will complicate things with the new caching mechanism for hubex mode.

potatohead · 2014-01-17 08:40

Personally, I'm going to have to take in layers. Find the reasonable set of "get it done anyway" instructions, work with those, get stuff running. Then under that basic "infrastructure" expand out.

We are going to need to document and we are going to need to share the basics for it all to make sense.

Having a lot of instructions is a better case than having not enough. On the P1, we saw a core set get used all the time, then some others not so much. It's the same for P2.

Re: lots of modes.

Yeah. My thoughts too. My only response is we helped make this mess, lol. So, we need to be there to make the most of it!

ctwardell · 2014-01-17 08:54

Seairth wrote: »

I believe that the normally-blocking instructions will instead self-jump (to avoid a pipeline stall) when in multitasking mode.

Do we have any detail on that. I seem to recall seeing that as well, but can't find it.

I'm wondering how that would work, it seems like it would be possible that depending on how the tasks are setup it might never get hub access.

C.W.

Propeller II update - BLOG

Comments