POLL: How much hub ram and how much cog ram would you prefer?

Cluso99 · 2014-05-14 22:10

Inspired by Chip & Roy's New Hub Scheme
http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip
I worked out a way to utilise additional cog ram above the 2KB (512 long) limit and run code from it at full speed
http://forums.parallax.com/showthread.php/155687-A-solution-to-increasing-cog-ram-works-with-any-hub-slot-method

Following on from this, I thought there might be a better way to manage cogs and hub.

Silicon space is fixed, but what if cogs had more memory and hub had less. We already have 512KB hub and 16 * 2KB dual port for cogs.
If the cog space is significantly increased, the cogs can run at full speed (say at 200MHz they will effectively be 100MHz = 100 MIPS), disregarding any hub transfers. By having much larger code/data blocks in cog, the only likely hub transfers would be for cog-cog via hub transfers. These can go at 800MB/s (32 bits every clock = 4B * 200MHz = 800MB/s) for each cog (in parallel)!!! For large transfers between cogs, overlapping transfers can be done.

So, why do you require a large hub space?
* Large programs (yes, you may need to split it up between cogs)
* Large data blocks (yes, you may need to split it up between cogs)
* Anything else ??? (I cannot think of any)

What would benefit from more cog space?
* Larger programs (GCC & PASM without LMM by using new hubexec in cog)
* Would 2+32KB cog space be ideal (matches old P1 cog and total hub ram per cog)
* Bigger buffers for video
* Cog based stacks (GCC and its JMPRET)

I don't have any fixed agendas but I am wondering now about whether the huge hub has enough of the benefits that would result if all cogs had large memory.
Initially, my reaction to my idea, was no way. But I think I have totally convinced myself that I would rather the cogs have more private memory.
Therefore, I am curious what might mix might be desirable.
Here are the voting options. Please post your reasoning. Select the closest to what you would like, and reasons for a different mix if you would prefer one.

512KB HUB, 16 x 2KB COGS
384KB HUB, 16 x 2+8=10KB COGS
128KB HUB, 16 x 2+24=26KB COGS
32KB HUB, 16 x 2+32=34KB COGS (a little more space may permit 64KB hub)
32KB HUB, 1 x 2+240KB=242KB & 15 x 2+16=18KB COGS (1 asymmetric cog, 240KB might be 256KB)

jmg · 2014-05-14 22:32

I'll wait till Chip indicates the Memory area on the newest respin, but the key drawback of

I would rather the cogs have more private memory.

is that word private. It is going to be largely wasted across 16 COGS, and spreading Code or Data into COGS, is not going to be one line in a HLL, not to mention consume COGS that could be doing real work, not acting as MUXes..

Cluso99 wrote: »

What would benefit from more cog space?
* Larger programs (GCC & PASM without LMM by using new hubexec in cog)
* Would 2+32KB cog space be ideal (matches old P1 cog and total hub ram per cog)
* Bigger buffers for video

I'm not quite following 'Larger Programs', unless you just mean 'Slightly Larger' above 2K, but below 32K.
Much Larger Programs, are actually worse off.

32Kb matches old P1 cog ?? - but that was shared over 8 COGS, not one. If you want to "match that" (tho I'm not sure why P1 RAM even matters) that comes in as 4K per COG.

* Bigger buffers for video - Err Where ? - again you seem to mean bigger than 2K, but the larger HUB buffer space, has been greatly reduced, so I see smaller buffers for Video.
The new HUB Rotate has higher bandwidth, so we need to see how that interacts with video first.

Even 512K is light for LCD Driving apps. 768k is better.

If Chip comes back with 768k plus some spare, then it makes sense to shovel some into COGS.

Cluso99 · 2014-05-14 22:42

I thought long and hard about my vote. I voted

128KB HUB, 16x 2+24=26KB COGS

but I would rather

256KB HUB, 2+16=18KB COGS

Because there were not enough options available.
Why? Because it gives the best mix for what I see. Remember, we were only going to have 128KB hub and 8x 2KB cogs until last November.

Cluso99 · 2014-05-14 23:03

jmg wrote: »

I'll wait till Chip indicates the Memory area on the newest respin, but the key drawback of

I would rather the cogs have more private memory.

is that word private. It is going to be largely wasted across 16 COGS, and spreading Code or Data into COGS, is not going to be one line in a HLL, not to mention consume COGS that could be doing real work, not acting as MUXes..

Ram will be wasted wherever it is if the app doesn't need it all. There have only been a couple of apps where I needed more hub ram - and there I added an external 512KB SRAM although I only required the larger part once.
Cogs are not going to be muxes. Everything comes out of a cog into a register, then to hub, then to another cogs register. That's how it works now, and how it will work for this. So it would be the same (or it might go via the msg pins).

I'm not quite following 'Larger Programs', unless you just mean 'Slightly Larger' above 2K, but below 32K.

Yes, below 32KB. But now we are restricted to 2KB and sharing via a slow hub, and slower with LMM.

Much Larger Programs, are actually worse off.

This is true and stated in the previous paragraph. Code might need to go more parallel for very large programs. Currently we don't even have these very large programs (except one of mine using external SRAM, as far as I am aware).
Remember, I am asking for reasoning. I don't have a fixed perspective, but I have convinced myself that some larger cog space is extremely useful. I just don't know how much.

32Kb matches old P1 cog ?? - but that was shared over 8 COGS, not one. If you want to "match that" (tho I'm not sure why P1 RAM even matters) that comes in as 4K per COG.

Even though it was shared, it was not necessarily divvied up equally. But I don't get your point.

* Bigger buffers for video - Err Where ? - again you seem to mean bigger than 2K, but the larger HUB buffer space, has been greatly reduced, so I see smaller buffers for Video.

Actually, the cog can hold a much larger buffer that can be filled extremely fast. Those buffers can be updated much faster in their own cogs (like sprites etc) and them when needed can be transferred (via hub) using an overlapped cog-hub-cog transfer extremely fast ~800MB/s (using Chips new method).

The new HUB Rotate has higher bandwidth, so we need to see how that interacts with video first.

Even 512K is light for LCD Driving apps. 768k is better.
If Chip comes back with 768k plus some spare, then it makes sense to shovel some into COGS.

From older info from Chip 768KB would have been tight, but we have just used another 2 cogs worth of space for the new hub method. We might squeeze a little extra for each cog, but not anything useful (remember hub needs to be blocks of ram) for hub.

So, what I am after...
Is there a better mix of hub and cog ram other than 512KB Hub & 16x 2KB Cogs ?

potatohead · 2014-05-14 23:16

Is there a better mix of hub and cog ram other than 512KB Hub & 16x 2KB Cogs ?

No. Seriously.

Maybe on the next design where it makes sense to reconsider the COG, but not this one.

Phil Pilgrim (PhiPi) · 2014-05-14 23:34

My answer: whatever mix gets it out the door yesterday. Other than that, I do not care, and neither shoild anyone else. If you can't trust Parallax with this simple matter, maybe you should be looking at Arduinos.

-Phil

RossH · 2014-05-15 00:00

I have this great idea!

We could have all these options if we implemented a table-based hub/cog memory sizing scheme! Why, if we had a primary and a secondary tables, then each cog could even have not only a different ram size but also a different word size! We could mix and match 8 bit cogs, 16 bit cogs, 32 bit cogs and 64 bit cogs!

Think of the flexibility! And by my calculations it would only take 315,000 additional LEs and 470 additional instructions ...

(Sorry Cluso ... couldn't resist!

)

Ross

Heater. · 2014-05-15 02:01

COG register space is, so far, fixed by the instruction set encoding format. With it's 9 bit source and destination fields.
It's simple, elegant and beautiful.
Trying to extend that as proposed seems like a ugly hack.
It would only result in half the possible RAM space sitting unused in COGs.
Why does a CPU need more that 512 registers anyway?

When we get to the 64 bit Prop III we can extend those src/dst fields and have huge amounts of COG space:)

Cluso99 · 2014-05-15 02:12

Heater. wrote: »

COG register space is, so far, fixed by the instruction set encoding format. With it's 9 bit source and destination fields.
It's simple, elegant and beautiful.
Trying to extend that as proposed seems like a ugly hack.
It would only result in half the possible RAM space sitting unused in COGs.
Why does a CPU need more that 512 registers anyway?

When we get to the 64 bit Prop III we can extend those src/dst fields and have huge amounts of COG space:)

So is hubexec an ugly hack? Then what do you call LMM then?

If you want absolute determinism, then additional ram as pseudo hub ram but totally private and accessible on every clock makes complete sense to me. With the 9bit address limitation, this is the next best thing. Its better than hub with the only exception it cannot be shared. This may or may not be the disadvantage that kills it.

RossH · 2014-05-15 02:31

Cluso99 wrote: »

So is hubexec an ugly hack? Then what do you call LMM then?

Actually, yes - I do think hubexec is an ugly hack. It actually sacrifices most of the power of the cogs for a fairly ordinary speed improvement over LMM. I mean, who needs a CPU with 512 registers?

On the other hand, even though the LMM technique was discovered late, it still uses the cogs the way they were intended to be used. I tend to think of PASM as being a kind of microcode (look it up you Propeller-heads who are too young to know what microcode is!) - it allows you to configure and customize each cog to execute whatever you want. This is why the P1 turned out to be so good at emulation.

Ross.

koehler · 2014-05-15 02:52

Cluso99 wrote: »

So is hubexec an ugly hack? Then what do you call LMM then?

If you want absolute determinism, then additional ram as pseudo hub ram but totally private and accessible on every clock makes complete sense to me. With the 9bit address limitation, this is the next best thing. Its better than hub with the only exception it cannot be shared. This may or may not be the disadvantage that kills it.

Cluso99,

Personally, I think the 18K Core + 256K Hub is the sweet spot, though I voted for 26K as the only option close enough.

I'm not sure what sort of hacking would be required to make this technically possible, however having Cores with actual memory of a size that is from this decade seems like it would have the minimum potential to increase interest in the Prop, whether that is actually desired is another thing.

Up until last week, a number of people where complaining that the Prop isn't a media device, doesn't need uber-B/W, and on and on.
Now, it seems like because Chip/Roy have sort of revisited an idea posted earlier by someone else, its pure genius and why yes, of course we NEED this massive B/W.

This forum is still one of the nicest on the web, however I'm convinced its some of the craziest

I fully understand and agree that it would be pretty sweet to have 16 Cores with 18K each, and allow them to NOT HAVE TO USE hub as actual program memory. The side benefit, or primary depending upon POV, of being able to hit 100 MIPs ON AVERAGE seems like just another no-brainer to me.

Aside from the fact that is SEEMS to be a far simpler oncept to understand and explain, and closer to what I originally THOUGHT the Prop was supposed to be when it first came out. As I expect most others who looked and left thought.

Chip's suggested change, along with your idea seems to finally make the Prop a fair representation of what most people expect from something called an 8/16 Cog/Core uC, without odd 2K Cog/Core RAM and a giant shared RAM pool.

And the fact that you need to use a kludge like LMM in the first place, which is nothing but a workaround to break through the limitations of the design, seems to still escape people.
And the fact that people still use it, proves that while the original design may have been for a certain use, its insufficient, and people WANT to use it in a more 'normal' way.

Although Ken says Chip iterates from complex to simple, I think your idea is not going to get much of a fair shake in discussion, and we're going to be stuck with something far more complex with a number of *, *, *, caveats.

Heater. · 2014-05-15 02:54

Yes, hubexec is a hack. Trying to bend the Propeller architecture into doing something it was not designed to do.
Not sure how ugly it is internally or externally yet.

LMM is normal program operation. It's just another VM, or emulation if you like.

RossH · 2014-05-15 03:00

Heater. wrote: »

Yes, hubexec is a hack. Trying to bend the Propeller architecture into doing something it was not designed to do.
Not sure how ugly it is internally or externally yet.

LMM is normal program operation. It's just another VM, or emulation if you like.

Damn! I'm agreeing with you again! And I was so sure I was right!

Ross.

Heater. · 2014-05-15 03:15

koehler,

"...having Cores with actual memory of a size that is from this decade...

Ugh, I have never seen a processor with 512 * 32 bit registers before.

But, following your train of reasoning look at the XMOS chips. Only 4K per "logical core" (as they like to call them.) The Prop is doing OK by comparison.

My point here is that is you can't really compare the Propeller architecture to more run of the mill devices so easily.

...Chip/Roy have sort of revisited an idea posted earlier by someone else...

I'm curious. I don't recall any such similar idea popping up on the forum. Anyway it is a bit genius don't you think?

koehler · 2014-05-15 03:15

RossH wrote: »

Damn! I'm agreeing with you again! And I was so sure I was right!

Ross.

Well I won't, hows that?

The fact that you need to use a VM-type of thing on a uController is rather a sad state of affairs.

Cluso's idea seems to actually eliminate the need for such a thing, unless you want it for truly large programs.
By so doing that, it also allows one's code to actually spend all of its cycles doing what it needs to be doing, and not have the overhead of the LMM, correct?

LMM is great because it helps alleviate the problem that many people apparently want to use the Prop in a way that it was not intended to be used. That being so, why would one OPT IN for LMM, if one had the option of not needing it for most use cases?

Heater. · 2014-05-15 03:22

koehler,

The fact that you need to use a VM-type of thing on a uController is rather a sad state of affairs.

I don't know. Parallax owes a lot of it's success to the BASIC STAMP. A hugely popular device with a BASIC interpreter. Then there are things like the PICAXE. And today we have the Espruinio and Tessel running JavaScript or the MicroPython set up.

I believe that approach is what led to the Propeller and it's built in Spin interpreter.

Cluso99 · 2014-05-15 03:47

Ross & heater,

Actually LMM is the hack. It is definitely not a VM.

However, hubexec operating in extended cog space is actually native operation for most cpus. The code runs from memory (in this case the extended cog ram) and the 2KB is 512 registers. The fact that the code can also run in cog register space is more of an artefact.

Hub is actually the shared buffer memory that ought to be designed to transfer data between various cores.

So, in reality, this concept of larger cog memory is more like cores operating in parallel, which is after all, what the prop is supposed to be ???

Heater. · 2014-05-15 04:33

Cluso,

Actually LMM is the hack. It is definitely not a VM.

What do you normally call a piece of software that:

1) Fetches some kind of operation code from somewhere.
2) Performs the operation specified by that "opcode".
3) Go to 1)

Many such systems are called Virtual Machines. Think Java or C#. In the old days we might have called them interpreters, like the p-code machine or the Spin byte code interpreter.

What does an LMM kernel software do:

1) Fetches some kind of operation code from somewhere.
2) Performs the operation specified by the "opcode".
3) Go to 1)

Looks like a VM to me. The Propeller does not naturally do all that without being loaded with the LMM virtual machine software first.

Whether that is a "hack" or not I'll for others to decide.

potatohead · 2014-05-15 07:55

It absolutely is a VM. No question.

I've been thinking for a while now LMM has more merits than we think. Had we done the work to make assembler support for it, standardized on a kernel or two, LMM might appear nearly transparent, just an extension of PASM, with some cool user definable opcodes.

mark · 2014-05-15 13:00

RossH wrote: »

I have this great idea!

We could have all these options if we implemented a table-based hub/cog memory sizing scheme!

It *is* a great idea!

koehler · 2014-05-15 13:28

potatohead wrote: »

It absolutely is a VM. No question.

I've been thinking for a while now LMM has more merits than we think. Had we done the work to make assembler support for it, standardized on a kernel or two, LMM might appear nearly transparent, just an extension of PASM, with some cool user definable opcodes.

Had the Prop had actual fast memory of any sufficient size, there would have been little need/desire for LMM except for those rare times when you need more than nK 'local' RAM.
It wasn't doable at the time, at least how Chip/Parallax wanted to or to spend the additional to make it happen.
So, for years the 'workaround' has been to use LMM, with all the benefit and downsides it offers.

With HubExec possibly being included, we're now supposed to see the Core/Cog RAM as 'registers', and 500+ of them!!!
Talk about twisting of reality to make a shortcoming look like some sort of advantage...lipstick on a pig.

Here is a neat idea which would give most of the Objects more than enough memory to not have to use hub as a surrogate for local RAM, be more efficient and productive overall for the majority of use cases,be more true to the actual concept of multi-core, and its going to be ignored... for the small subset who want video, which didn't even make Ken's Top 5 Real Customers Wish List.

potatohead · 2014-05-15 14:14

Who says it's about video?

I'm fine with the current access scheme, and having a nice, roomy HUB means a lot of flexibility in terms of how it gets used too.

BTW, the implications you raise regarding "multi-core" are precisely why I balked at the term years ago. The Propeller is a symmetric, concurrent, multiprocessor. That's different from a multi-core device that may happen to have some kind of interconnects or other.

This Propeller is a symmetric, concurrent, multiprocessor, and it has a very significantly improved throughput over the first generation design!

One common shared memory area is excellent! Can't wait for the FPGA.

Truth is, the original design vision of the Propeller 1 had very little to do with how it's often being used today. That's how sweet the design was! The other truth is the long cycle on this chip has meant pushing the Smile out of the current one, wildly exceeding any meaningful expectations in play when it was completed.

Really, COG native PASM is most like microcode. More than a few of us made this realization some time back. LMM in that context, particularly when it's possible to run LMM at very high speeds now with this new design, means being able to run large programs and actually optimize the COG performance to match the program needs. Since we appear to be going further, hubexec means those same advantages, with a nice array of register memory too!

It is common on Propellers to consider program type, size, data, needs and use the various parts of the chip, often in parallel, to realize the goal. A whole lot of the discussion seems to involve reaching a state where there is fast, jitter free hubex code, when the truth is, we don't need that. We need fast hubex code, which we are going to get, and we are going to get lots of cores all about doing that at the same time too! For those things where it's really going to matter, there is COG code, and that's just how the design is intended.

Seems to me this all would be less painful to take a look at the design intent and put that to use in an effective way, rather than invest pages and pages on how it really should be some other design intent altogether!

Lots of us are able to use the P1 just fine, and it's a great chip! This one will be no different.

koehler · 2014-05-15 17:24

potatohead wrote: »

Who says it's about video?

That has basically been the most oft talked about wish from several.
From the large users, it hasn't been in the Top 5.
More bandwidth is almost always good, except where an implementation of it causes other issue.

I'm fine with the current access scheme, and having a nice, roomy HUB means a lot of flexibility in terms of how it gets used too.

BTW, the implications you raise regarding "multi-core" are precisely why I balked at the term years ago. The Propeller is a symmetric, concurrent, multiprocessor. That's different from a multi-core device that may happen to have some kind of interconnects or other.

That can be argued either way, as multiprocessor is just a slightly vaguer term than the strict multicore.
The P1 can be argued to be a multicore uC, just with an really small local RAM.
Per wiki- "A multi-core processor implements multiprocessing in a single physical package. Designers may couple cores in a multi-core device tightly or loosely."

This Propeller is a symmetric, concurrent, multiprocessor, and it has a very significantly improved throughput over the first generation design!

So did the last 3(?) designs. Much as the original kitchen-sink P2 though, this version seems to now trade off complexity/simple determinism for large B/W potential for a select number of corner-cases.

One common shared memory area is excellent! Can't wait for the FPGA.

I fail to see how the benefits of a Large, Shared Hub memory pool for some select cases is better than the discussed benefits this idea would bring to the majority of Users/objects.

Truth is, the original design vision of the Propeller 1 had very little to do with how it's often being used today. That's how sweet the design was! The other truth is the long cycle on this chip has meant pushing the Smile out of the current one, wildly exceeding any meaningful expectations in play when it was completed.

Uhh, I just have to be blunt here.

What that actually proves is that some of the design decisions were totally wrong, and the only reason the Prop is still in production unlike the Javelin, is because a workaround was found to avoid those pitfalls.

I can appreciate the love Parallax gets for being as open and friendly as they are, however trying to revise history to paper over failings is not going to help them one bit.

As for the long dev cycle, Parallax owns that entirely, both through the design decisions made then, primarily lackluster adoption outside of their core markets.
One can be fair and honest when saying that if they had actually produced an IC with 8 Core, with 10-16K local RAM, and a Hub, that would have met with at least the same reception, and quite probably a significantly greater uptake and revenue increase.
That would have allowed them to put some real resources into further sustained incremental development.

I know the Prop was originally intended most likely to pick some slack from the Basic Stamp, or so it was said.
In that case, the design decisions were perfectly adequate for that and hobbyist use.
However IIRC, Parallax has tried to field this as a more business oriented product, and it has failed in the larger market.

I fail to see any business reason for this P2 to markedly change that even with a potential 800 MBs B/W in certain use cases. I do actually see a lot more potential for Parallax to have a chance of recouping past R&D and increase future revenue by simply listening to what the market wants after 8 years.
Its their money and they can obviously do as they like of course.

Really, COG native PASM is most like microcode. More than a few of us made this realization some time back. LMM in that context, particularly when it's possible to run LMM at very high speeds now with this new design, means being able to run large programs and actually optimize the COG performance to match the program needs. Since we appear to be going further, hubexec means those same advantages, with a nice array of register memory too!

Can't argue that, because I don't know for sure, and frankly, there seems to be quite a few people questioning that claim.
What I do know I think, is that for most use cases in OBEX, that is not what its primarily used for, nor has it been asked for by real customers of revenue.
Cluso's idea seems by far to be the best of both world wherein the average developer would see 16 actual functional cores with memory exactly like what they are used to and most comfortable with, while also giving them the option of using LMM for the very occasional 100K+ program/RAM requirement.

One would have to look at the OBEX to try and get an average object size, etc.

It is common on Propellers to consider program type, size, data, needs and use the various parts of the chip, often in parallel, to realize the goal. A whole lot of the discussion seems to involve reaching a state where there is fast, jitter free hubex code, when the truth is, we don't need that.

Jitt affects dterminism. IIRC, it is exactly THIS that both you and Heater have complained about non-stop in every thread on sharing, right?
Now for some reason its the opposite?

We need fast hubex code, which we are going to get, and we are going to get lots of cores all about doing that at the same time too! For those things where it's really going to matter, there is COG code, and that's just how the design is intended.

At this point, I am not sure hubexec is still on the table or not.

Seems to me this all would be less painful to take a look at the design intent and put that to use in an effective way, rather than invest pages and pages on how it really should be some other design intent altogether!

From my POV, quite a few people have looked at it, and feel that it is an upending of some of the foundations of what the Prop is supposed to be. I am assuming they have looked at the design, and find the trade-offs simply for B/W are simply not worth the other issues.

Lots of us are able to use the P1 just fine, and it's a great chip! This one will be no different.

I appreciate that you have unbounded enthusiasm and belief in Parallax.
All of this was said for the before with the original P2, the Uber-P2, P1b, and now this one.
This appears to be a great chip for some, and not so great for others.
Whats really going to matter, is how great is this for Parallax's large customers, and if this makes any dent in past investment and future profit.

potatohead · 2014-05-15 17:54

I have to laugh.

The P1 had no material failings. It is awesome, period.

Personally, I love how Chip does things. This HUB is different, but I'm gonna write some code befor I rule it out.

I never made one complaint about jitter. What I did and still do balk at is the idea that the COGS would perform differently once imbalanced.

Hubex code is for big programs and those just need to run quickly. LMM can serve the same purposes, and frankly, can have some real advantages, give a little assembler support and planning.

For those things that need to be spot on, COG code does that nicely.

Chip does things as he does them. I am a fan, and to be frank in return, I wrote code on everything he has created and found it powerful and less difficult than I thought.

So I'm going to do that again with this design. If I find I can't get it done, or it is crappy, I'll get after it then. Just like the other times.

I have some concerns about this design, but I've also read compelling approaches to it as well. When we get an FPGA to jam on, this kind of discussion will likely be more productive.

To be perfectly honest, there are a few of us here rather consistently advocating for an approach that isn't typical of what Chip does. I'm not going to support that because I have serious respect for how he has approached computing over the years, starting from 8 bit machines long ago.

If you really wanted to, you could go and read our words here at the time of P1 being released. The how, why, etc... are all there, as was I and others here. There are some basic ideas in play that matter, and they have been expressed over and over here. I believe in them, and when I apply them, I get stuff done too.

Which is why I'm going to jam on the FPGA and give this thing a go, not near constantly attempt to bias things away from the basic design ideas.

potatohead · 2014-05-15 17:57

BTW, notice that poll being 2 : 1 in favor of what I just wrote earlier?

Heater. · 2014-05-15 20:52

koehler,

What that actually proves is that some of the design decisions were totally wrong...

Perhaps. Do remember though that there would have been no point in Parallax producing a chip where all the design decisions were "right" according to your criteria. Creating the equivalent of yet another PIC, or AVR or ARM or MIPs would have been pointless. The world is already full of those. There is no way to make sales into that market. And by the time you have made that kind of design as efficient as the other offerings you find you can't put 8 cores on a chip, in the same way that they cannot. Then you need interrupts to support multiple events and such.

Boom...you have destroyed the original concept of the Propeller.

Seems those design decisions were right after all.

...and the only reason the Prop is still in production unlike the Javelin, is because a workaround was found to avoid those pitfalls.

I think this is demonstrably not true. I'm fairly sure that the number of C users is very small compared to the number of Spin users. That is to say the number of LMM users is very small. LMM is not supporting the Propeller.

Jitter affects determinism. IIRC, it is exactly THIS that both you and Heater have complained about non-stop in every thread on sharing, right?
Now for some reason its the opposite?

You misunderstand what I have been saying about "determinism". Most of my discussion is about having all COGs the same. It's about code executing on COGs not being able to modulate the speed of COG running on other COGs. This gives the wonderful property that any code can be mixed with any other code and it will always work. This ensures you don't get any weird random failures due to timing at run time. That is determinism.

This is often confused with the lower level timing determinism. Give that we have to create hardware interfaces by bit-banging on pins we had better be able to time that very accurately. That demands that the all instructions we use to do that execute in the same number of clocks. (Or it can be done the XMOS way with clocked I/O, I think many here would say that is more complex for the programmer though. It certainly mess up the "all pins are equal idea").

Now, we have bit-banging level determinism in the PII. We have code reuse level determinism in the PII. What we might not have is HUB read/write timing determinism.

It's not clear to me that we need HUB read/write timing determinism. As long as data can be moved in and out of HUB fast enough to meet throughput requirements. As long as we can bang on those pins with nano-second precision. Who cares if the HUB access is jittery. No one can see it!

Heater. · 2014-05-15 21:10

koehler,

...we're now supposed to see the Core/Cog RAM as 'registers', and 500+ of them!!!

I don't know about "now". I have always seen the Propeller like that. Even before there was an LMM. Bill Henning had not invented it yet!

    mov  R0, #3
    mov  R1, #5
    add  R0, R1

Hmm...looks like registers to me. Perhaps I have that view because my first ever Propeller project was a Z80 emulator.

What? Those instructions are actually also in the registers from where they are executed. Wow, yeah, that is a bit freaky weird. What idiot would design a processor like that?

Talk about twisting of reality to make a shortcoming look like some sort of advantage...lipstick on a pig.

But guess what? It means you can run jitter free bit banging code without having to fetch instructions from a RAM shared with 7 other processors. Wow that's cool, much, much faster!

Hey, that "shortcoming" is looking like a stroke of genius!

Remember that at the Pi design time there were not enough transistors available to give each COG it's own RAM even if you wanted to.

Should the PII use it's extra transistor budget to stuff RAM into COGs? Hardly seems worth it to me.

Cluso99 · 2014-05-15 21:32

Heater. wrote: »

I think this is demonstrably not true. I'm fairly sure that the number of C users is very small compared to the number of Spin users. That is to say the number of LMM users is very small. LMM is not supporting the Propeller.

Most likely very true.
But it is on the list of most required features for P2, as posted by Ken !!!

You misunderstand what I have been saying about "determinism". Most of my discussion is about having all COGs the same. It's about code executing on COGs not being able to modulate the speed of COG running on other COGs. This gives the wonderful property that any code can be mixed with any other code and it will always work. This ensures you don't get any weird random failures due to timing at run time. That is determinism.

Well, the new hub method certainly has the possibility of breaking this run in any cog idea.

This is often confused with the lower level timing determinism. Give that we have to create hardware interfaces by bit-banging on pins we had better be able to time that very accurately. That demands that the all instructions we use to do that execute in the same number of clocks. (Or it can be done the XMOS way with clocked I/O, I think many here would say that is more complex for the programmer though. It certainly mess up the "all pins are equal idea").

Now, we have bit-banging level determinism in the PII. We have code reuse level determinism in the PII. What we might not have is HUB read/write timing determinism.

But heater, that has been your argument all along. Every cog must be identical because it may impact HUB read/write determinism.
Now you are taking the opposite view because Chip has proposed it.

It's not clear to me that we need HUB read/write timing determinism. As long as data can be moved in and out of HUB fast enough to meet throughput requirements. As long as we can bang on those pins with nano-second precision. Who cares if the HUB access is jittery. No one can see it!

What!
This has been the very essence of all your arguments against any slot-sharing ideas.

I love what has been proposed.
But, in all honesty, any cogs relying on fixed hub slots may break if they change cogs and/or hub memory addresses. And this potentially includes hubexec from hub.

And you are against any method that solves these problems (putting more memory into the cog to be used as "normal" code and data memory, not registers).

potatohead · 2014-05-15 21:47

You know what's interesting?

With the slot sharing, it was possible to create configurations that would break COG code. Code written for one scheme, would perform poorly, or differently on another scheme. Lots of COGS running lots of different code and schemes breaks down to lots of special "works for me" cases. Sharing gets harder overall, whether or not the author intended for that to be the case.

Now, with this scheme, "the mix master" scheme as I'm calling it myself, I think it's possible to create code that could depend on a specific COG, but in general, code written for COGS, works on COGS.

The problem is inverted. If somebody really wants to, they can lock code into a specific COGID and addressing scheme. If that code is moved, it could run differently, but only that code is broken. Everything else works on any COG.

Most cases, in fact a lot of them, will be "works everywhere" cases, and somebody creating a "works for me, or this time" case, gets locked into whatever COG / addresses they wrote for. Sharing then is easy, except for when it's not, and that's the opposite of the slot sharing scheme. Sharing gets harder when the author intends it for very specific combinations of COG / address, otherwise it's easy.

Cluso99 · 2014-05-15 21:59

potatohead wrote: »

You know what's interesting?

With the slot sharing, it was possible to create configurations that would break COG code. Code written for one scheme, would perform poorly, or differently on another scheme. Lots of COGS running lots of different code and schemes breaks down to lots of special "works for me" cases. Sharing gets harder overall, whether or not the author intended for that to be the case.

Now, with this scheme, "the mix master" scheme as I'm calling it myself, I think it's possible to create code that could depend on a specific COG, but in general, code written for COGS, works on COGS.

The problem is inverted. If somebody really wants to, they can lock code into a specific COGID and addressing scheme. If that code is moved, it's broken, but only that code is broken. Everything else works on any COG.

Most cases, in fact a lot of them, will be "works everywhere" cases, and somebody creating a "works for me" case, gets locked into whatever COG / addresses they wrote for. Sharing then is easy, except for when it's not, and that's the opposite of the slot sharing scheme. Sharing gets harder when the author intends it for very specific combinations of COG / address, otherwise it's easy.

Actually, you have this precisely the wrong way around !!!
Code written for one cog (if tightly coupled as was done on P1) is more likely to fail under this scheme if it is placed into a different cog. It is also likely to fail if the data (and/or possibly hubex code) is moved in hub from where it was originally tested.
So there you have it. Go do the math on the slots/cogs/addresses. They vary the timing.

BTW I am still for this method because it's better than none, and can be worked around. Just as the many other schemes.

Heater. · 2014-05-15 22:13

Cluso,

...the new hub method certainly has the possibility of breaking this run in any cog idea.

Please explain how because I don't see it.

What I see in the new scheme is that any HUB access may have to wait from none to fifteen of whatever amount of time it takes the HUB mechanism to move around a "notch".

I see that the number of those "notches" it has to wait for is dependent on the lower bits of the address. If I'm writing in a HLL and don't know or care about the addresses of anything then effectively my HUB access time has been randomized.

I see that this scheme dramatically increases over all HUB bandwidth and hence COG performance.

I see that all COGs get the same view of all this all the time.

So how does it break the "run in any cog" idea?

But heater, that has been your argument all along. Every cog must be identical because it may impact HUB read/write determinism.

No, it has not been my argument all along. How could you read all that I wrote in that post, and millions of other posts, and make that conclusion?

The "all COGs equall" idea is all about not having a COG be able to modulate the speed of any other COG. I have often expressed it exactly that way. It all about decoupling.

Now you are taking the opposite view because Chip has proposed it.

No, I'm not taking the opposite view. The new HUB sharing scheme makes all cogs equal. Why would I not like that aspect?

There may be other reasons I don't like it. I have not had time to think about it much yet.

This has been the very essence of all your arguments against any slot-sharing ideas.

No it has not, see above.

But, in all honesty, any cogs relying on fixed hub slots may break if they change cogs and/or hub memory addresses.

So don't do that!

Really it's pretty crappy practice have your code fail because it has been compiled to a different RAM address or run up in a different COG. Blech!

However, you raise an interesting point. It does seem to make it easier for people to carefully craft code that works fine given that it is the right COG and uses the right HUB memory addresses but breaks otherwise. In possibly subtle and unpredictable ways.

Hmm...thinking on...I guess it also means I could not be careful at all and still manage to write code that works fine in my project, in the COG it is assigned to, at the addresses the compiler happens to use, but fails in subtle hard to debug ways if any of that changes!

How would I know if I had accidentally done that? How much head scratching is that going to cause?

This does sound bad. Is it actually in practice?

And you are against ... putting more memory into the cog

Actually I'm a bit relaxed about that. My feeling is that any way to do that is a bit "hacky". I'm really not keen on the idea of having less HUB RAM. It has to be taken from there right? But I could live with it if we can keep to 256K or more in HUB.

POLL: How much hub ram and how much cog ram would you prefer?

Comments