Propeller II update - BLOG

T Chap · 2013-12-06 11:57

or DNU for do not use

MF1 mystery function 1

once the code is out, the code is out.

OBEX upload disallows any unaccepted words

Bill Henning · 2013-12-06 11:58

Good call. Rush is more recognizable to people than the Colossal Cave these days.

ctwardell wrote: »

How about in honor of Canada, and Rush of course, call it YYZ...

C.W.

David Betz · 2013-12-06 12:00

Bill Henning wrote: »

Good call. Rush is more recognizable to people than the Colossal Cave these days.

I know what the Colossal Cave is. What is Rush? (can you tell that I'm a geezer?)

Heater. · 2013-12-06 12:00

Chip,

...don't make it public, at all. Only tell the sort of people who really want to know, maybe requiring participation in some weekend retreat deep in the woods.

I do hope you are jesting. That would imply it's only used under NDA in closed source projects. Even so it would not be long before it was "outed". And runs counter to the spirit of openness you have going on here.

Still, we have plenty of forest just north or here if you'd like to come over and discuss it:)

Bill Henning · 2013-12-06 12:03

A famous Canadian rock band.

http://en.wikipedia.org/wiki/Rush_%28band%29

David Betz wrote: »

I know what the Colossal Cave is. What is Rush? (can you tell that I'm a geezer?)

Ym2413a · 2013-12-06 12:05

I'm not for the Slot Sharing idea at all.
I like the simple symatry in the round ribbon method of COG to HUBRAM.
I can just picture some sort of object code making it into slot 5 shared with a HUB heavy video engine on slot 1 and stop working. With the lose of the Video DAC bus. that would just add one more level of confusion in which order to execute starting COGs.

Sometimes simpler is better.
Plus, I'm not into the idea of the added risk of adding such a feature last minute so that two or three people can use it.

Ym2413a · 2013-12-06 12:07

Bill Henning wrote: »

A famous Canadian rock band.

http://en.wikipedia.org/wiki/Rush_%28band%29

Isn't Rush Canada's biggist export? : ]

*joking*

ctwardell · 2013-12-06 12:09

Ym2413a wrote: »

Isn't Rush Canada's biggist export? : ]

*joking*

Them and Maple Syrup!

Heater. · 2013-12-06 12:10

T Chap

I really don't get this logic. Just add the features if it is easy to do, and ban objects with it included from OBEX.

Except this issue of object "mash ups" becoming unpredictable is not bounded by the OBEX. It's not bounded by Spin either.

What we want is for the P2 to be a wild success with code and projects sprouting up from wherever. You know, github overflowing with goodies to use.

They had better all work happily together.

Heater. · 2013-12-06 12:17

Chip,

Now you have done it.

If the PII comes out without that feature no one will be sure if it is in there or not. Except those who have been to the forest who know it is, if it is of course.

Either way the search will be on to find the missing feature. It's going to drive us all insane!

Rush has been around forever, who or what on Earth are Colossal Cave?

David Betz · 2013-12-06 12:29

David Betz wrote: »
Okay, I'll try a quick example:
    org $1000 ' hub mode address

    ' do some stuff
    LCALLA #my_hub_fcn
    ' do some more stuff
    HALT ' I know this isn't an instruction but I just wanted to indicate the end of this instruction sequence

my_hub_fcn
    ' do some stuff
    LCALLA #my_cog_fcn
    ' do some more stuff
    RETA

    org $0 ' COG mode address

my_cog_fcn
    ' do some stuff
    CALL #my_other_cog_fcn
    ' do some other stuff
    RETA

my_other_cog_fcn
    ' do some stuff
my_other_cog_fcn_ret
    RET
So you see that you can call either hub or COG functions with LCALLA and you can use RETA to return whether it's a COG mode or hub mode function. However, you can only use CALL/RET from COG mode code.

Note: The call from the main code to my_hub_fcn remains in hub mode so no mode transition happens. The call from my_hub_fcn to my_cog_fcn transitions from hub to COG mode. The call from my_cog_fcn to my_other_cog_fcn doesn't make any mode transition.

The RET in my_other_cog_fcn makes no mode transition since it is returning to COG mode code but the RETA in my_cog_fcn makes a mode transition from COG mode to hub mode on returning to my_hub_fcn. Lastly, there is no mode transition when returning from my_hub_fcn to the main code.

All of these transitions are made based on looking at bits 31:9 of the target address. If those bits are zero then the target is in COG mode. If they are non-zero, the target is in hub mode. Based on the current mode and the target mode transitions happen if necessary. This all happens automatically.

Sorry to bump my own message but I'm wondering if anyone has any comments on this?

jmg · 2013-12-06 12:31

cgracey wrote: »

- slots-
It was a long, drawn-out discussion. My thought at the moment is this: maybe implement it, but don't make it public, at all. Only tell the sort of people who really want to know, maybe requiring participation in some weekend retreat deep in the woods.

Not sure how much of that is tongue in cheek, but interrupts on normal micros fall into the 'handle with care' department.
Plenty of novices tangle themselves as they grasp interrupts.

One important principle, and easy to manage, with objects, is to always provide a commented working example.
Users then have a known reference point, and can move-off and merge from here, with the usual care and attention.

Slot-managing is not going to be in every design, but I recall looking at a shiny new 3 task uC a couple of years ago.
With 3 slots, I thought Great, I'll map 2 onto one core, and use the 3rd for low speed tasks.
Alas, that chip designer failed to allow slot controls, so I had only 3 time-shared cores/tasks, with 1 slot each.
Blew the time budget, and the design hit a brick wall. For the want of a couple of flip flops, in this case.

T Chap · 2013-12-06 12:32

There are objects in OBEX that load their own cogs behind the scenes, and a user finds out real quick that stuff is not working as expected. They dig around and find that they are unknowingly trying to run 10 cogs total since they didn't thoroughly study all objects to find get a total of cogs required. What makes slot borrowing any different? This idea of mandatory orthogonality seems way over the top.

cgracey · 2013-12-06 12:36

msrobots wrote: »

...Since fast ADC/DAC has now a binding between COG and pins we can not use 2 COGS to drive the same pins fast. A lot of video driver use multiple cogs on P1. Is this still needed on P2 ?

I don't know. I think a single cog now has means to gather enough data to make any display on time. Other cogs can compute the display data, if necessary.

Only the streaming DAC output is cog-tied to pins. ADCs stream back via the IN signals, so any cog can process those.

Heater. · 2013-12-06 12:36

David,

I really have not been following how this is going to be made to work but that looks very fine to me from a user perspective.

Can we also, by symmetry, make jmps and/or calls to HUB functions from within COG functions?

Actually what are the limits? Looks like a COG function written to be called from HUB cannot be called from COG. And vice versa. That's probably not so bad.

jmg · 2013-12-06 12:40

Heater. wrote: »

What we want is for the P2 to be a wild success with code and projects sprouting up from wherever. You know, github overflowing with goodies to use.

They had better all work happily together.

- but this is a dream, as time+silicon is finite.

At some stage, you will always run out of COGS, Memory, pins, or time-slots

I see this similar to interrupts on standard micros, those also need care, and users can cut and paste only to a given level, before they have to grasp the trade offs.

The use of hub-slot sharing will be rare, but if user risk matters, then I have already suggested this

* A lower-bandwidth, but 100% guaranteed secondary slot system, for those COGS that yield their slots to a faster COG.
That lower bandwidth path, is still faster than P1, and cannot be further impacted

So there is a small silicon and implementation cost, but you can make this more robust.

David Betz · 2013-12-06 12:43

Heater. wrote: »

David,

I really have not been following how this is going to be made to work but that looks very fine to me from a user perspective.

Can we also, by symmetry, make jmps and/or calls to HUB functions from within COG functions?

Actually what are the limits? Looks like a COG function written to be called from HUB cannot be called from COG. And vice versa. That's probably not so bad.

Yes, I think that would be possible although you'd have to use LCALLA or LCALLB (or my proposed LCALLREG) not CALL since there is no way to write to the corresponding RET instruction in hub memory.

Ym2413a · 2013-12-06 12:49

OBEX is what makes the propeller wonderful!
Grab a SD-card object, Get one of the Video objects. Get a sensor object. It allows you to focus more on whatever your writting and not on all the little bit banging matters. Plus there are a lot of smart people who are more then happy to share their well written objects with everyone.

What makes the OBEX work is the symatry of the Propeller, each cog is the same, has it's own RAM, counters, resources, etc. So I'm highly aganst anything that puts the design out of balance.
I can picture some novice downloading a slot stealing SD-card object from a website and wondering why it stops working when they start their PS/2 object. They might think the problem is with the COG they just started and not what was already running... leading to countless hours of headaches and chasing their own tail.

The Propeller is designed to be balanced, simple and beatiful. If a car had 8 wheels, I wouldn't want one of them to be of a different size or shape. :]

Heater. · 2013-12-06 12:55

jmg,

...interrupts on normal micros fall into the 'handle with care' department. Plenty of novices tangle themselves as they grasp interrupts.

That is true. However it's not the software and timing tangle of interrupts that is analogous here. Two pieces of code that need low interrupt response latency will fail when combined no matter how smart the programmer is. Similarly of they require excessive processing in the handlers.

I have to repeat, this is not about novices "getting into a tangle" over HUB slot allocations. It's about people mixing and matching objects, which may well have been written by experts, and then finding the combination does not work. Now your novice fails in his endeavours and decides the Prop is Smile or he has to waste his time becoming a Prop guru in order to find out what's broken rather than getting on with his project.

There are objects in OBEX that load their own cogs behind the scenes, and a user finds out real quick that stuff is not working as expected.

Perhaps true. Such things should be documented in big red letters. I suspect HUB slot collision is a lot more subtle and intractable.

Seems to me, the Prop II design is already far exceeding original performance goals, especially if the "execute from HUB" gets in. Do we really need to break COG orthogonality to get just that little bit more?

Adding more surprises to Propeller behaviour is not exactly supporting the Principle of Least Surprise.

Ym2413a · 2013-12-06 13:00

Heater. wrote: »

jmg,

That is true. However it's not the software and timing tangle of interrupts that is analogous here. Two pieces of code that need low interrupt response latency will fail when combined no matter how smart the programmer is. Similarly of they require excessive processing in the handlers.

I have to repeat, this is not about novices "getting into a tangle" over HUB slot allocations. It's about people mixing and matching objects, which may well
have been written by experts, and then finding the combination does not work. Now your novice fails in his endeavours and decides the Prop is Smile or he has to waste his time becoming a Prop guru rather than getting on with his project.

Perhaps true. Such things should be documented in big red letters. I suspect HUB slot collision is a lot more subtle and intractable.

Seems to me, the Prop II design is already far exceeding original performance goals, especially if the "execute from HUB" gets in. Do we really need to break COG orthogonality to get just that little bit more?

Adding more surprises to Propeller behaviour is not exactly supporting the Principle of Least Surprise.

You hit the nail on the head.
Least surprises is best.

And what makes the Propeller appealing to beginners and experts alike is the simple fact that you can mix and match objects for rapid development.
It's nice when things work on first try. ;]

Heater. · 2013-12-06 13:04

jmg,

- but this is a dream, as time+silicon is finite. At some stage, you will always run out of COGS, Memory, pins, or time-slots

Err... now you are being weird, I was not proposing that a Prop II be capable of running all the code on github at once

I see this similar to interrupts on standard micros..

BINGO, now you have it. That is the best argument against this wonky HUB slot timing idea.

potatohead · 2013-12-06 13:14

I can think of only one use case for multiple COGS driving Video DAC pins together for the same display, and that is the case where component video runs a monochrome PIN at high resolution, and another COG runs the two color difference pins at a lower resolution.

Obviously, multiple displays works just fine with the changes made. No worries, other than some pin placement considerations.

I had intended to write this for the case of moving video happening in a small memory foot print. If it never happens, it never happens. For lower resolution displays, this won't be optimal, but at HDTV resolutions, a very serious gain in memory footprint and with that fill rates could be had with very little overall quality perception loss.

Honestly, making the HUBEXEC feature work in a simple, robust way is worth doing the work. The chip will be much better positioned for design wins. Worth it. And it's a huge boost for large programs. The idea of larger programs on a "master COG" has been around since Bill thought LMM up.

You know, Xmos had complexity in this area and they offered timing tools, etc... to help people understand what can work with what. Seemed a big barrier to using the chip. There were other things mixed in with that complexity that more or less made a barrier to adoption high enough that only specialized cases made sense. Think Cable TV set top box type embedded.

I would prefer not to see this at all, leaving the robust scheme in place. That gets us, "works no matter what", and I think that's proven powerful on P1.

However, if it *has* to be done, the pairing is the next best thing. That gets us "works no matter what, given a few cases" and there are 4 slots possible for sure where that would be true.

A general purpose slot sharing with priority, etc... means we can't ever say "works no matter what", and it's off to the development assistant tools and special cases get maximized over more general, robust cases. I'm not sure that aligns with the overall strength of the product. Sure, it's peak performance, given a lot of things, but maximizing that will cost the more general cases.

And here's the thing: This chip, with HUBEXEC, is going to rule a ton of general cases! If I were tasked with figuring out potential revenue, likely adopters, markets related, etc... I would have a very hard time justifying maximizing performance, unless those special cases were very solid design win potentials.

Are they? Seriously. I'm asking on a basic business level here. Take the geek hats off for a minute and think about what that trade-off means. It meant a lot of Xmos, despite a LOT of effort to do otherwise. I've seen examples in the past too. "Works no matter what" is extremely powerful, and over all, it's the clear winner in a lot of market places. Best possible tech doesn't always win. Accessable, robust, consistent, low hassle tech wins a TON of cases.

So aside from the Philosophy, that is my beef with adding it. I know I could always write "works no matter what" compliant code, as could others. But I also know the first time I bump into a wall, I would look hard at the quick speed drug to power through it rather than refactor the problem in ways that are more parallel and deterministic.

Doing that is what leads to the nice, big, code bodies we've all come to know and love, and I just don't think it will happen as envisioned, simply due to everybody maximizing their case to get their stuff done, that's all.

Lastly, can you guys wanting this feature comment on slot sharing vs the general case wide open variable slots and the potential impact on the higher performance cases you envision? In other words, does it really have to be whole hog, or is there a very basic, common sense step that can be taken to get a net gain overall?

I want to get on board with that. Help me. I'm flat out opposed to the general case, but the pairing is something I'm flirting with because we could do it and still say, "works no matter what" with very few qualifiers, which again I consider EXTREMELY important to the overall positioning of this product for high adoption rates, which it seriously needs to do.

I see this similar to interrupts on standard micros..

BINGO, now you have it. That is the best argument against this wonky HUB slot timing idea.

Agreed. My thoughts exactly.

T Chap · 2013-12-06 13:14

Well here is an idea, how about nobody post any other suggestions/improvements(however good) that may cause non-orthogonality.

edit: added smiley

In general practice, I am daily wanting more cogs, more counters, more ram, more spin speed. Even when the P2 comes out, we will all be trying to get it to do more than it was "designed" to do.

potatohead · 2013-12-06 13:22

The use of hub-slot sharing will be rare,

I disagree. I think it will be near constant. The idea of boosting speed easy will be the very first thing many people do in order to avoid the more painful problem and code refactoring needed to better utilize the chip. It's too easy.

In fact, one case that will happen right away is the super fast COG running a big program, then suddenly the other COGS aren't so useful... And then people will be thinking, "if only I could interrupt this super cog a little", etc... and those discussions go all over the place, many with no good, simple ends. More importantly, the expectation that 8 powerful cores are in there goes off the table, with people lamenting that all the COGS don't go that fast, and so on....

I think a lot of those cases are a tease. Really, a more powerful CPU should be used, with the P2 offering up it's parallelism and real time attributes. Or, the problem should be refactored to parallalize better. Pointing most of the power at a COG or two, dilutes the strengths without actually offering up as many advantages. Non optimal.

Given the current scheme, it's not a dream. Everybody will work on the best possible code, and the stuff that floats up to the top will be useful because it works no matter what. If that's not in the thing, everybody will maximize their case, leaving little "works no matter" what code out there.

And that may explain why I'm open to sharing. At least that case has a solid boundary. Minimum "works no matter what multi-core" expectations can be set, for 4 cores in "turbo mode" This is easy. Still non optimal, but I could be sold on it. Sell me.

Bill Henning · 2013-12-06 13:27

http://en.wikipedia.org/wiki/Rush_%28band%29

Heater. wrote: »

Rush has been around forever, who or what on Earth are Colossal Cave?

ctwardell · 2013-12-06 13:27

I don't think a decision should be made on Hub Slot Sharing until the SERDES/CRC is done.
A lot of push for the SERDES/CRC is based on getting a good USB implementation, so if it turns out that Hub Slot Sharing is also needed to support the desired USB modes, then it's in, otherwise it can be left out.

If it is included I'm fine with pairing.

C.W.

David Betz · 2013-12-06 13:32

Heater. wrote: »

Rush has been around forever, who or what on Earth are Colossal Cave?

You're kidding right? Colossal Cave, otherwise known as Adventure, was the original text adventure game. I used to be a big fan of text adventures. I even created a simple programming language called AdvSys for creating them.

And, I guess I have heard of Rush the rock band. I just can't say I can recall any of their songs.

potatohead · 2013-12-06 13:39

This is too funny!

I would have expected most of you to get those references.

RUSH = Gods own Band! --> Quote from more than a few fans I've met over the years. I regularly listen to "A Show Of Hands" during long computer sessions. Never gets old.

jmg · 2013-12-06 13:41

Heater. wrote: »

jmg,

Err... now you are being weird, I was not proposing that a Prop II be capable of running all the code on github at once

My point was that there are always going to be resource-ceiling issues, even without extra features.
Extra features are not be be feared.

Heater. wrote: »

BINGO, now you have it. That is the best argument against this wonky HUB slot timing idea.

And yet those terrible, user biting devices, sell in HUGE volumes, the Prop can only dream about ?
Again, the point is to let the silicon be as powerful as it can, and it will sell. Users are not as dumb as you imagine.

To limit it, on some 'Protect the novices' mantra, is frankly quite self-defeating.
Parallax semiconductor exists to sell devices.

Heater. · 2013-12-06 13:50

potatohead,

All well said, and surprisingly concisely coming from you

Can anybody make a compelling case for this wonky HUB timing idea?
How much extra performance are they expecting?
What uses cases does that cover that can't be covered without it?
Do those use cases constitute a notable increase in PII marketability and possible sales?

My gut tells me: not much, not many and only a tiny amount. And that the price for this is high.

I could be wrong of course, so let's hear it.

I try not to mention the X word much here but as you brought it up lets talk XMOS devices.

They are brilliant. They are fast. They are deterministic. They have multiple cores and hardware scheduled threads. They have a ton of I/O. They have high speed communication channels between cores and threads. They have SERDES and clocked parallel I/O support.

What's not to like:

1) In order to make use of the many features, channels for example, you have to work in XC a mutant of the C language.
2) You cannot access any pin from any core. Each core has it's own set of pins.
3) You cannot access all of RAM from a core. Each core has it's own RAM space.
4) You cannot use any pin any how you like. Pins have to be configured into "ports" or various widths, 2, 4, 8, 16 bits.
5) In each "port" all pins are either input or out put.
6) You cannot use any random pins to form a port, there are a very limited set of possible port configurations. Want an 8 bit port? You may lose the possibility of that two bit port you wanted.
7) No analog I/O.

I could go on.

As Potatohead says, these things make the chip inflexible and hard to use.

I don't know if this XMOS talk is relevant here but it's a look at a direction we may not want to go in at least.

Propeller II update - BLOG

Comments