The case for Additional/Extended COG RAM (+2/4/6/8KB)

Cluso99 · 2014-05-19 01:56

RossH wrote: »

I'm not against HubExec - I just don't regard it as the main game for programming the Propeller. To get the most benefit from its unique architecture, the "normal" way to program a Propeller will remain the parallel execution of cog-based PASM objects.

My objection is the purely related to the complexity it would add. As well as four different types of RAM on the Propeller (five if you include XMM), each with different access instructions, we will also now have three "native" execution modes:
Execution from Cog RAM

Execution from Hub RAM

Execution from Extended Cog RAM (which seems to be a hybrid of the above two).

Plus, we will still have LMM mode, since (I believe) none of the above can address more than the 512K of Hub RAM and we already run programs larger than that on the P1 (and we will want to do even more on the P2). And we will probably still have CMM mode, just as the ARM has "thumb" mode, for code compactness where speed is not critical.

So, now we have at least three native modes two additional special purpose execution modes.

In total, five RAM types and five execution modes. Is this really necessary?

Ross.

There will not be all these modes, because the new default mode will be "normal mode" which will be a single mode representing the flat memory model and is a combination of

Execution from Cog RAM (executes from registers)
Execution from Extended Cog RAM (executes from the new cog ram)
Execution from Hub RAM

Because these modes share a flat memory model, the differences are only:

Execution from addresses <$200 = Execution from registers
- this is the only code that supports self-modifying code
- compilers will fail if the D or S registers are >=$200 (as they do now)
Execution from addresses $200-$DFF = Execution from new Cog Ram
- $DFF will depend on the amount of new cog ram implemented
- self-modifying code not supported
- compilers will fail if the D or S registers are >=$200 (as they do now)
Execution from addresses $E00-$7FFFF = Execution from shared hub ram
- self-modifying code not supported
- compilers will fail if the D or S registers are >=$200 (as they do now)
- slower execution due to hub slot mechanism
- possible jitter, latency & deterministic problems (user to resolve)

However,

CMM mode is likely to gain wider use because of the larger cog program space, and the faster and deterministic code this produces
XMM mode will be required for programs using external memory (as for now)
LMM mode if hubexec doesn't work properly ??? (as for now)

So, I only see benefits, not perceived problems which in reality do not exist.

Cluso99 · 2014-05-19 02:12

Even if the new extended cog ram is not implemented, everything mentioned and required in this thread will still be required to be done if hubexec is to be implemented from hub ram.

RossH · 2014-05-19 02:16

Cluso99 wrote: »

There will not be all these modes, because the new default mode will be "normal mode" which will be a single mode representing the flat memory model and is a combination of

...

So, I only see benefits, not perceived problems which in reality do not exist.

Hmmm ... you have just agreed with everything I said - in fact, you have added yet another programming mode - but you don't think that complicating the programmer's job to this extent is any kind of problem. So I may as well stop trying - if you can convince Chip (he is also a hardware engineer, so you probably stand a better chance) then good luck to you!

Ross.

RossH · 2014-05-19 02:31

Cluso99 wrote: »

Even if the new extended cog ram is not implemented, everything mentioned and required in this thread will still be required to be done if hubexec is to be implemented from hub ram.

Which is precisely why I am not a strong supporter of HubExec. The cost and complexity of it (compared to the alternatives) does not seem to make it all that worthwhile.

However, Chip still seems keen, so we may get it (i.e. unless he discovers it is too expensive).

Ross.

potatohead · 2014-05-19 07:45

The current plan is to get the basic design running, then consider hubex code.

A few minutes at most.
Nothing else is required - nothing, nothing, nothing...

I strongly doubt that.

What I see here is some additional RAM, exclusive to the COG, which means bigger programs that avoid the HUB. How big? Well, as big as possible! Extend that some, and really, who needs a HUB, or much of one? Let's just make the COGS really big, and have 16 of them in there...

Kerry S · 2014-05-19 08:21

potatohead wrote: »

Let's just make the COGS really big, and have 16 of them in there...

Yes! With smaller shared HUB ram for data exchange. That would be a more marketable chip than what we are heading towards... Except for the people already using the Prop1 who are not going to significantly expand market share or even market volume (not including Ken's industrial customers who I would bet would prefer to have the extra industry standard use CORE memory).

Cluso is right here. This severe CORE memory limitation, on an otherwise great design, needs to be fixed RIGHT and not by yet another kludge just to keep some illusionary product metaphor (a propeller...).

What he is suggesting is the simplest and most new customer friendly approach to the issue. And you still get to keep your Whirling Cyclone of Data Propeller, only it is used for shared data and NOT program code. Now Ray's super block mode for data transfers really shines!

We all want the same thing guys... a GREAT chip that Chip is happy with that expands Parallax's sales so we get a P3 and P4 and ...

Brian Fairchild · 2014-05-19 08:40

potatohead wrote: »

...unless one decides the Propeller concept is a failure.

Depends on what you believe the 'propeller' concept is.

koehler · 2014-05-19 10:24

Brian Fairchild wrote: »

Depends on what you believe the 'propeller' concept is.

Somewhere along the line, Prop has gone from a neat h/w idea inspired by a general concept, to implementation, to a rigid, dogmatic orthodoxy.

Except, when it isn't, like Chip's recent change that throws a bit of a spanner into determinism...... sort of like God revoking one of the 10 Commandments (Or Buddha, or FSM).

Very few people here give 2 splits about whether Parallax attracts any new revenue generating customers to the product.
All that matters is getting new shiny ASAP for personal use, and magic will happen and Parallax will always have money for the next shuttle run for a P3.

This is something that could probably be made a voluntary switch upon Prop power-up, so thats that.
If this feature could have somehow been added before, its quite possible Parallax Semi might have been an actual success.

This is such a no-brainer in the 'make things a little more approachable to new customers' path.
Not including it (assuming it is so simple), simply means Parallax has no real need for more revenue, or this has truely become a fanaticism worth xxxx $$$.

I'm sure extra revenue coming into Parallax could make for some nice bonuses, pay raises, new and improved mfg. equipment, new computers, etc, etc.

But no, it breaks my fantasy of how the Prop works, and thats more important....

Very sad.

Heater. · 2014-05-19 10:53

Way to go koehler.

You tell Chip how he should design his chip. Never mind anyone else's opinions. They are so obviously wrong. Poor misguided fools.

However. Last time I checked it was Chip's project done on his own time and paid for with his own money.

We, humble groupies, can only express our dreams, desires and concerns and, better still, suggest solutions to problems.

And that includes you.

At the end of the day it's Chip's call.

potatohead · 2014-05-19 11:06

Hi Koehler.

Hugs 'n kisses. PH

Chip mentioned smoothing out some of the HUB access times. I think learning to make better use of the HUB and or improving on how it can be used is the best scenario, compared to enlarging non-shared COG memory.

Simple as that.

Cluso99 · 2014-05-19 12:02

Guys, this is really a no brainer.

Imagine if the P1 had an extra 2KB of cog ram, and the JMPRET (JMP/CALL/RET) S operand was Relative. The only other requirement is the PC be 10 bits instead of 9 bits.
Of course there probably was insufficient silicon.

heater - ZiCog would not have required overlays!

How many other programs/drivers could have been written with double the cog ram but only the same register space.

I bet a lot of video drivers could have made excellent use of it.
Expanding spin would have been easy, as would have been my faster spin (25+%).

LMM would have been unnecessary for numerous programs, but not all.

As I understand it, this was the biggest request by customers.

But we were all blindsided by the false requirement that S & D needed to be increased to 10 bits, when in reality I have just come to the conclusion that this is nonsence. We just need the PC increased, a Relative JMPRET, and more cog ram.

All fully deterministic, no latency, no jitter, no special hub slot scheme, all at full cog speed.

Why didn't think of that? Wait a minute... I did!
That is what this thread is about!!!

Works with any hub slot scheme, works with LMM, works with XMM, works with CMM, works with GCC and Catalina, works with SPIN.

And HUBEXEC from hub will work if the PC is increased, and JMPRET is changed to support a larger S (source) operand, and a flat memory model for combining cog ram and hub.

Heater. · 2014-05-19 12:19

Clusso,

heater - ZiCog would not have required overlays!

Perhaps true.

However it's not a good point of argument here.

ZiCog did not suffer from use of overlays. The entire 8080 base instruction set could be emulated in COG if I remember correctly. Well, apart from the pesky DAA instruction that was never used in CP/M code much anyway. The rest of the 500 odd instructions were done with overlays. Slow but who cared? They were never used in CP/M code. (Apart from the memory block instructions occasionally)

No, ZiCog was crippled by the need to keep the 64K of Z80 RAM in external memory. A 128K of HUB RAM on the P1 would have made ZiCog fly at running CP/M a lot faster than any COG space expansion.

pjv · 2014-05-19 14:18

Cluso99 wrote: »

Imagine if the P1 had an extra 2KB of cog ram, and the JMPRET (JMP/CALL/RET) S operand was Relative....

Cluso99,

That would have been phenomenal.

As you know, I use a scheduler to simultaneously run multiple threads in a single cog in the P1. The fact of not having either an indirect or relative addressing scheme, causes that code to be perhaps 25 % larger (have not done an exact analysis), and that eats into the cogram remaining for application code. It also makes the scheduler slower to run, and speed is king here as task switches want to be as fast as possible. Also, those application codes would themselves have benefitted from that addressing scheme, making them perhaps 10% to 20% smaller. From day one I have lamented the lack of advanced addressing!

Evenso, I still love the prop for what it brings me.... simplicity and timing predictability (solid determinism).

And with your scheme the cogram is enlarged to boot, allowing more threads to co-exist in a single cog!

I LOVE the concept.

Cheers,

Peter (pjv)

koehler · 2014-05-19 14:31

Caution, long rant ahead.

Heater,

Ken said he welcomed comments, and criticism, even brutal IIRC.
My impression is that Ken went to the expense and effort to set-up Parallax Semi in the hopes that they could generate more revenue, not just because he had nothing else to do....

Rather than trying to put words in my mouth, why don't you admit that "The fools" were the ones who pushed Chip to have everything under the sun bolted onto the last P2hot, regardless of whether thats what the actual customer of Parallax wanted, needed, or may be willing to pay for. I'd argue they were not 'poor misguided fools', but people who are looking to have Parallax make a personal toy for them, regardless of whether it increases Parallax's revenue/volume.

Chip and Ken can pick and choose who they want to hear or listen too.
The last time around, its clear Chip listened to all the 'experts' and "Ooh, I want this" crowd. And it turned out to be a blow-torrch design.

There has been a genuine technical issue in the past which has precluded adding Core memory to the Prop.
With Chip's current proposal, and Cluso's suggestion, there is no longer any technical issue preventing Parallax from escaping its reputation as a 2K processor.
You and others may be happy to workaround this issue with various fixes, and call engineers to who don't look at the Prop short-sighted or something, however the past 8 years have shown that precious few others are interested in such a time sink. Where would Parallax, Ken, Chip, any everyone else here be if they're able to just double the sales of the Prop for the next couple of years because more engineers find 16 Cores with actual memory much more approachable, even with this unique s/w peripherals?

"And that includes you."

Yes it does, and I have no problem arguing for this whatsoever.
I see comments from Ken and others that this or that can't be done, a webpage can't be updated because of resource issues, and so on.
This tells me that Parallax most likely can use some extra revenue, like anyone else.

I'm not looking at them taking on ARM, AVR, or anyone else.
I'd like to see them be more successful, enough that a Parallax-Semi might be able to be re-started, and revenue flow in sufficient to support future efforts.

"At the end of the day it's Chip's call."

Everyone is well aware of that. And Chip can be easily led off path as history has shown.

For some, I think this is a product more for navel-gazing self-interest than realizing it is supposed to help keep people employed and fund future work.

JRetSapDoog · 2014-05-19 14:41

Cluso99 wrote: »

What aren't you getting ??? Or are you really being argumentative for the sake of it - I really don't know ???

Cluso/Ray: How can they be so obtuse!! (trying to inject a little humor here)

Shawshank Redemption: https://www.youtube.com/watch?v=dakxwoVV7yM
Family Guy Version: https://www.youtube.com/watch?v=obFHu7DCsEs

I'm surprised at how utterly flawed this proposal seems to some.

Seems to me that evaluating it is far from a black-and-white endeavor.

If something like it gets implemented, people could easily change their tune.

jmg · 2014-05-19 15:00

potatohead wrote: »

Chip mentioned smoothing out some of the HUB access times.

Yes, Chip mentioned a form of FIFO for Video DMA style streaming, that allows fSys/N video, also with LUT option.
He mentioned 20 deep using registers+muxes, but I think this can pack into 16 deep Dual Port RAM + modest Logic.

That can feed HUB data in a linear-address manner, at any N of fSys/N, into a LMM or some form of hubexecute pipe.
That HW design means jumps are still going to have time-impacts on HubExec, and will favour Skip-style coding and opcodes.
(Note that QuadSPI Flash used as Execute In Place also has a high Jump cost & likewise favours Skip-style coding and opcodes)

I agree getting a full speed linear streaming solution working first, should be a higher priority, but 16D DualPort RAM is saving some area on the original proposal, so perhaps that frees COG Logic for the LUT to also allow Code Execute.

Invent-O-Doc · 2014-05-19 15:19

Although I have been opposed to hub sharing schemes, I support a larger COG ram space if it doesn't complicate the development of the new P2.

That said, there may be some other factors (die space / power consumption / addressibility). Would people trade larger COGs for fewer of them? Perhaps 12?

As far as Cluso99's initial comments, I don't see people running 16 COGs in hubexec mode, maybe 1-3.

potatohead · 2014-05-19 15:39

If something like it gets implemented, people could easily change their tune.

Maybe!

Staying true to the basic design should be a priority. We got the P2 "hot" edition over ongoing stuff like this.

jmg · 2014-05-19 15:45

Invent-O-Doc wrote: »

That said, there may be some other factors (die space / power consumption / addressibility). Would people trade larger COGs for fewer of them? Perhaps 12?

I'd agree die-scaling allows decisions on 16/15/14..12 COGs, but I suspect the new Rotate HUB Memory scheme, is less flexible.
That maps the lower Address nibble to each COG, in sequence, which locks it to 16 spokes, but I guess it could spin around less than 16 outer COGs (as COGs have no idea how many siblings they have )
If the FIFO and Rotate do not change, this makes development stable and final COG count can be a late decision.

Heater. · 2014-05-19 15:50

koehler,

Rant accepted.

cgracey · 2014-05-19 16:35

Invent-O-Doc wrote: »

Although I have been opposed to hub sharing schemes, I support a larger COG ram space if it doesn't complicate the development of the new P2.

That said, there may be some other factors (die space / power consumption / addressibility). Would people trade larger COGs for fewer of them? Perhaps 12?

As far as Cluso99's initial comments, I don't see people running 16 COGs in hubexec mode, maybe 1-3.

It would be trivial to add 4 bits to the 32-bit architecture so that everything becomes 36 bits wide and we'd then have 2K longs of cog RAM.

Right now, I figure the cog logic is going to be about the same size as the cog RAM. If we were to 4x the cog RAM, the cogs would grow in size by 150%. I think it is better to have those MIPS than more cog RAM. What do you think?

Heater. · 2014-05-19 16:40

I'll take the MIPs.

potatohead · 2014-05-19 16:46

MIPS here too.

jmg · 2014-05-19 17:17

cgracey wrote: »

It would be trivial to add 4 bits to the 32-bit architecture so that everything becomes 36 bits wide and we'd then have 2K longs of cog RAM.

Right now, I figure the cog logic is going to be about the same size as the cog RAM. If we were to 4x the cog RAM, the cogs would grow in size by 150%. I think it is better to have those MIPS than more cog RAM. What do you think?

What MIPs do you mean here ?

RossH · 2014-05-19 17:38

cgracey wrote: »

It would be trivial to add 4 bits to the 32-bit architecture so that everything becomes 36 bits wide and we'd then have 2K longs of cog RAM.

Right now, I figure the cog logic is going to be about the same size as the cog RAM. If we were to 4x the cog RAM, the cogs would grow in size by 150%. I think it is better to have those MIPS than more cog RAM. What do you think?

MIPS!

cgracey · 2014-05-19 18:00

MIPS, as in millions of instructions per second.

If the cog wound up 2.5x the size, due to this increase in cog RAM, we could only fit less than half the cogs.

jmg · 2014-05-19 18:07

cgracey wrote: »

MIPS, as in millions of instructions per second.

If the cog wound up 2.5x the size, due to this increase in cog RAM, we could only fit less than half the cogs.

Ah, so you mean, more COGS, or fewer COGs.
Is the current plan to have Video LUT separate, or 'borrowed' from COG RAM ?

David Betz · 2014-05-19 18:42

cgracey wrote: »

It would be trivial to add 4 bits to the 32-bit architecture so that everything becomes 36 bits wide and we'd then have 2K longs of cog RAM.

Right now, I figure the cog logic is going to be about the same size as the cog RAM. If we were to 4x the cog RAM, the cogs would grow in size by 150%. I think it is better to have those MIPS than more cog RAM. What do you think?

36 bits? That sounds great. Can you implement the PDP-10 instruction set please? :-)

ozpropdev · 2014-05-19 18:51

Mips rule!

I've never heard a racecar driver complain when you tell them your giving them more horsepower!

The same seems to apply to computer users.

RossH · 2014-05-19 19:01

David Betz wrote: »

36 bits? That sounds great. Can you implement the PDP-10 instruction set please? :-)

The way things are going, it will be more like the VAX instruction set!

Did anybody ever actually use the POLY instruction to solve arbitrary length polynomial equations?:

The table address operand points to a table of polynomial coefficients. The coefficient of the highest-order term of the polynomial is pointed to by the table address operand. The table is specified with lower-order coefficients stored at increasing addresses. The data type of the coefficients is the same as the data type of the argument operand. The evaluation is carried out by Horner's method, and the contents of R0 (R1'R0 for POLYD and POLYG, R3'R2'R1'R0 for POLYH) are replaced by the result.

Or any of the millions of character processing instructions? For example, here is SPANC:

The assembler successively uses the bytes of the string specified by the length and address operands to index into a 256-byte table whose first entry (entry number 0) address is specified by the table address operand. The logical AND is performed on the byte selected from the table and the mask operand. The operation continues until the result of the AND is zero, or until all the bytes of the string have been exhausted. If a zero AND result is detected, the condition code Z-bit is cleared; otherwise, the Z-bit is set.

Now THAT'S an instruction set!

The case for Additional/Extended COG RAM (+2/4/6/8KB)

Comments