The case for Additional/Extended COG RAM (+2/4/6/8KB)

David Betz · 2014-05-19 19:07

RossH wrote: »

The way things are going, it will be more like the VAX instruction set!

Did anybody ever actually use the POLY instruction to solve arbitrary length polynomial equations?:

Or any of the millions of character processing instructions? For example, here is SPANC:

Now THAT'S an instruction set!

I could never warm up to the VAX instruction set. The Propeller is more like the PDP-8 with its self-modifying subroutine call instruction.

jmg · 2014-05-19 19:16

ozpropdev wrote: »

I've never heard a racecar driver complain when you tell them your giving them more horsepower!

True, but if you offered that racecar driver the choice of that horsepower spread over 16 GoKarts, or 12 Formula 1000, then he might pause to think a little more.

RossH · 2014-05-19 19:18

David Betz wrote: »

I could never warm up to the VAX instruction set. The Propeller is more like the PDP-8 with its self-modifying subroutine call instruction.

The entire Linux kernel could have been micro-coded into a single VAX instruction!

Ross.

cgracey · 2014-05-19 19:30

jmg wrote: »

Ah, so you mean, more COGS, or fewer COGs.
Is the current plan to have Video LUT separate, or 'borrowed' from COG RAM ?

Each cog will have a single-port 256x32 RAM for LUT use.

jmg · 2014-05-19 19:47

cgracey wrote: »

Each cog will have a single-port 256x32 RAM for LUT use.

Thanks.
Does that single-port mean the COG cannot update the LUT during read, or will you allow COG writes to share with higher-priority Table-Reads, (time shared) for fSys/2 and all slower streaming rates.
fSys/1 has no spare slots, so that has to be write-then-start, but that will be rare, and for the larger NCO generated fSys/N, time-shared writes should be ok ?

cgracey · 2014-05-19 20:02

jmg wrote: »

Thanks.
Does that single-port mean the COG cannot update the LUT during read, or will you allow COG writes to share with higher-priority Table-Reads, (time shared) for fSys/2 and all slower streaming rates.
fSys/1 has no spare slots, so that has to be write-then-start, but that will be rare, and for the larger NCO generated fSys/N, time-shared writes should be ok ?

Good idea! I hadn't thought about that. We could wait for up to 2 clocks, and if we never had a free slot, we'd just steal the third.

Cluso99 · 2014-05-19 20:27

Chip,
Seems you have misunderstood what I am trying to convey. I must not be conveying it properly so I will try again from a different POV.

If you are adding a single port 256x32 RAM for LUT use, this can be used as additional cog ram in exactly the way I am suggesting to add more cog ram.
So let us use that as the example...

The 256 long LUT ram can be accessed as cog addresses $200-$2FF
Increase the PC to 10 bits
Change all JUMPS (DJNZ, etc) to Relative +/-127
- I think this should be done in P2 in any case
Change JMPRET (JMP/CALL/RET) to Relative +/-127 for S (source) and D (Destination) remains absolute
- This means that the Return Address can only be saved in a register (ie $000-$1EF)
- The GCC guys are going to want a CALL that saves in a fixed cog register, say $1EF anyway
To access the LUT as data, a new RDLUT/WRLUT (or MOVLUT) would be nice (but not absolutely required)
- Or RDLONG/WRLONG could be used where if the hub address was <$300 it was in LUT
  - This means that we could not access Hub ROM addresses <$300 - would this be a problem ???

Nothing else needs to change.
We still have D & S operands restricted to normally referring to Cog Registers $000-$1FF.
Instructions executing from LUT cannot be self-modifying (not a problem - we will not be able to do this in hubexec either).
We do not need to expand to 36 bits.
And now we have the means to expand PC to 17 bits to support hubexec simply.
This now gives us 3K of total cog ram. It could be increased further in the same fashion.

Have I explained this simply enough???

cgracey · 2014-05-20 00:51

Cluso99 wrote: »

Chip,
Seems you have misunderstood what I am trying to convey. I must not be conveying it properly so I will try again from a different POV.

If you are adding a single port 256x32 RAM for LUT use, this can be used as additional cog ram in exactly the way I am suggesting to add more cog ram.
So let us use that as the example...
The 256 long LUT ram can be accessed as cog addresses $200-$2FF

Increase the PC to 10 bits

Change all JUMPS (DJNZ, etc) to Relative +/-127
I think this should be done in P2 in any case

Change JMPRET (JMP/CALL/RET) to Relative +/-127 for S (source) and D (Destination) remains absolute
This means that the Return Address can only be saved in a register (ie $000-$1EF)

The GCC guys are going to want a CALL that saves in a fixed cog register, say $1EF anyway

To access the LUT as data, a new RDLUT/WRLUT (or MOVLUT) would be nice (but not absolutely required)
Or RDLONG/WRLONG could be used where if the hub address was <$300 it was in LUT
This means that we could not access Hub ROM addresses <$300 - would this be a problem ???

Nothing else needs to change.
We still have D & S operands restricted to normally referring to Cog Registers $000-$1FF.
Instructions executing from LUT cannot be self-modifying (not a problem - we will not be able to do this in hubexec either).
We do not need to expand to 36 bits.
And now we have the means to expand PC to 17 bits to support hubexec simply.
This now gives us 3K of total cog ram. It could be increased further in the same fashion.

Have I explained this simply enough???

Sorry I didn't see this earlier. I understand what you are saying here. That could increase the program space by 50%. I wish it could get us around needing hub exec, altogether.

jmg · 2014-05-20 01:01

cgracey wrote: »

. That could increase the program space by 50%.

Yes, if the memory is already there, (for LUT) and the smarter opcodes there too, then it makes sense to allow code to run from this memory. Avoids silicon being wasted, or under used.

Brian Fairchild · 2014-05-20 01:03

cgracey wrote: »

I wish it could get us around needing hub exec, altogether.

It can; it's just that something else has to give.

cgracey · 2014-05-20 01:19

Brian Fairchild wrote: »

It can; it's just that something else has to give.

What do you mean? This possibility really intrigues me because hub exec throws a lot of complexity into the cog.

Cluso99 · 2014-05-20 01:42

cgracey wrote: »

Cluso99 wrote: »

Chip,
Seems you have misunderstood what I am trying to convey. I must not be conveying it properly so I will try again from a different POV.

If you are adding a single port 256x32 RAM for LUT use, this can be used as additional cog ram in exactly the way I am suggesting to add more cog ram.
So let us use that as the example...
The 256 long LUT ram can be accessed as cog addresses $200-$2FF

Increase the PC to 10 bits

Change all JUMPS (DJNZ, etc) to Relative +/-127
I think this should be done in P2 in any case

Change JMPRET (JMP/CALL/RET) to Relative +/-127 for S (source) and D (Destination) remains absolute
This means that the Return Address can only be saved in a register (ie $000-$1EF)

The GCC guys are going to want a CALL that saves in a fixed cog register, say $1EF anyway

To access the LUT as data, a new RDLUT/WRLUT (or MOVLUT) would be nice (but not absolutely required)
Or RDLONG/WRLONG could be used where if the hub address was <$300 it was in LUT
This means that we could not access Hub ROM addresses <$300 - would this be a problem ???

Nothing else needs to change.
We still have D & S operands restricted to normally referring to Cog Registers $000-$1FF.
Instructions executing from LUT cannot be self-modifying (not a problem - we will not be able to do this in hubexec either).
We do not need to expand to 36 bits.
And now we have the means to expand PC to 17 bits to support hubexec simply.
This now gives us 3K of total cog ram. It could be increased further in the same fashion.

Have I explained this simply enough???

Sorry I didn't see this earlier. I understand what you are saying here. That could increase the program space by 50%. I wish it could get us around needing hub exec, altogether.

I think that the cog needs to be at least 4+KB in total.

What if the registers were reduced to 256, and the higher 256 dual port were changed to 512 single port (same silicon space)?
Adding the 256 single port LUT gives 4KB (1KB of instructions) of which the lower 256 are also registers.

Now, hubexec would not be so critical for hub access because of the larger cog space.
But almost all the hooks are there for hubexec (minimal support version). It only requires...

PC increased to 17 bits
JMP/CALL/RET Relative & Absolute 17 bit addressing (you did this for the old P2)
- You could just extend the JMP/CALL/RET for the above LUT case. (ie the JMP/CALL/RET determines the address range for cog or hub)
PC fetch mechanism needs to use addresses >$3FF (probably above ROM would work) as fetch long from hub.
- Ignore caching if it's too complex. In this case, it would run ~1:17 speed.
- At least we could trial things and see if instruction caching is required, or where you are up to with the 16 FIFO idea.

Nothing else should be required.

Brian Fairchild · 2014-05-20 02:00

cgracey wrote: »

What do you mean? This possibility really intrigues me because hub exec throws a lot of complexity into the cog.

My statement is based around answering a fundamental question...

Is the hub there to support the cogs or are the cogs there to make use of the hub?

...and a secondary question of...

Is hubexec just on the table because it felt like it would be easy? After all, it's only pulling a few longs, one after the other, from memory.

In an ideal symmetrical world the questions wouldn't need to be asked. Everything would be possible but the design is constrained by some fundamentals which it is stuck with.

I started to continue this post with the following...

In other words, are you designing a really neat processor core, with lots of cool IO features, which can, out of the box, emulate the peripherals people have on other chips but WOW, you mean I can customise how my USART works? And it'll work at stupidly high speeds and I don't need to worry about interrupts? Sorry, did you just say that there's 16 of these in the package? So if I want a chip that has 32 USARTS I can have one? With room to spare? And there's a mechanism I can use to move data around, at high speed, between those cores?

Or, are you designing a chip with 512k of RAM whose contents can be written to and read by 16 cores which surround it and which can

...and then ran out of steam. Which I guess sums up my views.

At the heart of this discussion, and all the others over the past couple of weeks, is a question...

Is the P16X64A an application processor with soft peripherals or a soft-peripheral processor which can run applications?

And I guess whilst we all have our own ideas, it's very much down to Chip to decide which approach grabs his interest.

Brian Fairchild · 2014-05-20 02:09

And note that I said...

Brian Fairchild wrote: »

And there's a mechanism I can use to move data around, at high speed, between those cores?

...as I'm not even convinced it needs to be RAM. It could even be a high-speed message passing mechanism.

RossH · 2014-05-20 02:14

cgracey wrote: »

What do you mean? This possibility really intrigues me because hub exec throws a lot of complexity into the cog.

But this doesn't replace HubExec. It extends the size of the cog (by a modest amount) requires a new paradigm (which complicates the programming), and still doesn't replace HubExec (or LMM).

Ross.

Cluso99 · 2014-05-20 02:24

RossH wrote: »

But this doesn't replace HubExec. It extends the size of the cog (by a modest amount) requires a new paradigm (which complicates the programming), and still doesn't replace HubExec (or LMM).

Ross.

Ross,
How many times do I need to say, the new paradigm is in your head. It doesn't complicate things. It is way easier than writing LMM code which requires paired instructions for every jump and call.
Please take a look at the hubexec LCD code I posted back in March. It is way way simpler than LMM code!!! And there are so many other benefits too.

Roy Eltham · 2014-05-20 02:30

Chip,
I think we could get by without hubexec. It does mean slower performance for the C compilers results and anything else that would have used hubexec instead of LMM. The performance difference would depend greatly on how much of hubexec you were going to do, with cache lines (and prefetch) being the primary perf difference. If you were not intending to do cache lines for hubexec, then the perf difference is not enough to spend the complexity on.

I would certainly not put any hubexec stuff in the FPGA image you are working on now, and only consider it after everything else is good and we've had a chance to work with the FPGA image a while.

Don't get me wrong, I would really like to have hubexec with cache if it's reasonable, but I certainly don't want it to muck up the works.

Roy Eltham · 2014-05-20 02:38

Cluso,
In order for LUT memory to become executable, it would need to be dual ported (maybe not, but it'll have severe restrictions described below), and Chip has said that it was only going to be single ported.
That means it would double in size right? Would that mean going to less than 16 cogs? or less hub memory? I don't like either of those choices....

I'm also not sure how much I like the limitation that returns can't target the "extra" executable memory, nor could that memory be self-modified or manipulated in any of the normal ways that the "regular" cog memory can. It means there'll be some hefty restrictions on the code that could go in that "extra" code space.

RossH · 2014-05-20 02:58

Cluso99 wrote: »

Ross,
How many times do I need to say, the new paradigm is in your head. It doesn't complicate things. It is way easier than writing LMM code which requires paired instructions for every jump and call.
Please take a look at the hubexec LCD code I posted back in March. It is way way simpler than LMM code!!! And there are so many other benefits too.

And yet every time you describe it, you have to list the limitations you have to impose on your program to run in each separate memory space.

Ross.

Cluso99 · 2014-05-20 03:25

Roy Eltham wrote: »

Cluso,
In order for LUT memory to become executable, it would need to be dual ported (maybe not, but it'll have severe restrictions described below), and Chip has said that it was only going to be single ported.
That means it would double in size right? Would that mean going to less than 16 cogs? or less hub memory? I don't like either of those choices....

I'm also not sure how much I like the limitation that returns can't target the "extra" executable memory, nor could that memory be self-modified or manipulated in any of the normal ways that the "regular" cog memory can. It means there'll be some hefty restrictions on the code that could go in that "extra" code space.

Roy,
With respect, you have not understood what I am trying to do. Obviously I am not explaining it as well as I should.

I have suggested reducing the register set to 256 so that the other 256 can become 512 single port utilising the same silicon space.
By adding in LUT (single port) adds another 256. Now we have a total cog ram of 4KB of which only the bottom 256 are registers. This uses exactly the same amount of silicon as the old cog registers + new LUT. So we have doubled the instruction space without using any more silicon area.
The jumps become relative ie +/-127 (like DJNZ, etc)
The JMPRET becomes S (source) where execution moves to is Relative, and the D (destination) of the saved address must be in a register (below $200 or below $100 if the registers were reduced to 256). GCC requires a fixed register, so this is easily done.
If Chip expanded the JMPRET instruction to be like the JMP/CALL/RET on the old P2, with Relative and Absolute addressing and 23 bits of immediate address/es then you can jump/call/ret anywhere in cog ram. This already works for hubexec on the old P2 because I have used it.
Self-modifying code would only be available within the register space (because it requires dual port ram). GCC does not use self-modifying code.

So what I am proposing is quite simple.

Roy Eltham · 2014-05-20 03:38

Cluso,
Yeah, I get it. My post wasn't worded very well, I am very tired after driving 13 hours to get home. Also, I kind of got it more as I was going along typing, I should have edited my post more. Anyway...

I think restricting self-modifying code and return addresses to a limited subset of the code space is pretty drastic. Pretty much all of my PASM code heavily uses self modifying code. I don't remember if we have INDx stuff in this version of the chip or not. If we do have it then the limit on self modifying code is less painful, but if we don't have INDx, then self-modifying code is critical and needed all over the code, not just in a limited subsection.

I think 2kb of fully usable cog memory will be more powerful than 4kb of cog memory with restricted/limited use for 3/4ths of it.

Cluso99 · 2014-05-21 22:08

FWIW the relative addressing is +/-256.

I forgot we have 9 bits, not 8

pik33 · 2014-05-21 23:33

Stop complicating things again. All we really need is fast loading from hub to cog. The hub access slot is every 16 cycles. If the cog is capable of loading 8 longs at once in one hub access slot then we can make a LMM which will be fast enough. So throw away any these hubexec, LUT exec, etc. Make a simple Propeller with 512k hub, 250 MHz clock, 16 cogs and 64 IOs. Let these cogs are simply P1 style, only add MUL and DIV instructions. Ship the chip. The community will do rest of miracles then.

koehler · 2014-05-22 00:33

pik33 wrote: »

Stop complicating things again. All we really need is fast loading from hub to cog. The hub access slot is every 16 cycles. If the cog is capable of loading 8 longs at once in one hub access slot then we can make a LMM which will be fast enough. So throw away any these hubexec, LUT exec, etc. Make a simple Propeller with 512k hub, 250 MHz clock, 16 cogs and 64 IOs. Let these cogs are simply P1 style, only add MUL and DIV instructions. Ship the chip. The community will do rest of miracles then.

While I agree with your general points, I think you're in the wrong thread.
Out of all of the sausage making going on, Cluso's idea is the simplest add-on compared to any of the other mind-bending proposals.
It uses what Chip is already intent upon doing (re-architecture), and adds little to no silicon, few if any s/w changes, and double the Cog/Core RAM-equivelent space, -without- forcing one to use LMM.

Ohh, and if it is actually anywhere close to 100MIP's native, then its kicks LMM's butt straight to the curb.

At this point, Chip should have an FPGA image running internally in several days, so we might see a basic image early next week.
Its doubtful there is any going back, not even sure if Cluso's idea will get a fair appraisal.

Roy Eltham · 2014-05-22 00:39

koehler,
except in code size. LMM/hubexec = 512k codespace.

koehler · 2014-05-22 13:36

Roy Eltham wrote: »

koehler,
except in code size. LMM/hubexec = 512k codespace.

Roy, however with Cluso's idea, it would seems as though one could have several Cores with >2K Core/Cog RAM equivilent, and still run LMM on other Core's using 384k+, or did I miss something?

Several weeks ago many were saying to go ahead with the P1x, as massive program space is rarely needed, or applicable use-case.
Now that there is peak b/w available, the P2 is 'obviously' going to be used by 'most' to craft giga-KB+ programs....

Aside from some forumistas personal desires, Cluso's suggestions seems to directly deal with Ken's stated Customer Request for more RAM, at least within the Core. With Hub already being 10-12x greater than the P1.
I get the feeling some people here seem to want all that Hub RAM for video buffers, at the expense of all the use cases which can/could simply use a moderately larger Core RAM space.

If as simple as believed, this seems like the BEST feature yet to actually get the P2 some real notice outside its current limited customer base. Turning the P2 into what is basically a PSOC-type of thing seems to be some's desire, yet not in any way supported by market research done.

Oddly enough, since I am not a fanboy on either side of that fence, my POV is that faster, bigger Cores that can work on data locally vs hub, seems like it would be a home-run, if it can be done with minimal h/w s/w.

It seems funny.
Even after all of this work, sounds like people are going to need multiple Cores running LMM to get the aggregate compute level they may need, that 1 of Cluso's Cores with RAM+ could meet at 2,4,6x LMM speed.

At least thats what it looks like from this newbie's perspective at this time.

Roy Eltham · 2014-05-22 13:48

While I may use some of the 512K ram for video buffers in some projects, my primary use for the 512K will be for code/data for large C++ programs that use a proper runtime instead of a trimmed down tiny one (like most use on the P1).

Cluso99 · 2014-05-22 14:11

Roy Eltham wrote: »

While I may use some of the 512K ram for video buffers in some projects, my primary use for the 512K will be for code/data for large C++ programs that use a proper runtime instead of a trimmed down tiny one (like most use on the P1).

Everyone was happy with 128KB (well accepted 128KB) hub with the old P2.

Roy,
1. What changed from accepting 128KB Hub?
2. Can you please describe your applications? Including...
3. How many cogs and what size code does each need? PASM or GCC?

4. Do you believe all cogs must be equal in terms of cog ram?
I think some cogs need more memory so they can do the grunt work.

BTW I no longer buy the argument about flat memory space requiring compiler complexity. People are accepting that the compiler will generate interlaced/scrambled hub memory and that is way more work than any flat space memory model that has a couple of "memory" holes.

kuba · 2014-05-22 14:15

Cluso99 wrote: »

I think that the cog needs to be at least 4+KB in total.

What if the registers were reduced to 256, and the higher 256 dual port were changed to 512 single port (same silicon space)?
Adding the 256 single port LUT gives 4KB (1KB of instructions) of which the lower 256 are also registers.

[...]
PC increased to 17 bits

JMP/CALL/RET Relative & Absolute 17 bit addressing (you did this for the old P2)
You could just extend the JMP/CALL/RET for the above LUT case. (ie the JMP/CALL/RET determines the address range for cog or hub)

PC fetch mechanism needs to use addresses >$3FF (probably above ROM would work) as fetch long from hub.
Ignore caching if it's too complex. In this case, it would run ~1:17 speed.

At least we could trial things and see if instruction caching is required, or where you are up to with the 16 FIFO idea.

[...]

I like this idea. Quite a lot, in fact. It seems to make the LUT RAM extension a much more seamless character. Coupled with hubexec those seem to almost demand each other

I know, it's a very non-technical term, but sometimes the artistic beauty in the design is nice, too

Heater. · 2014-05-22 14:21

Did we really get as low as 128K in the old P2? That is really not good enough,

Flat memory space does not require compiler complexity.

However what have been describing is far from a flat memory space.

It has many regions, HUB, COG registers, COG not quite registers and whatever else.

Presumably a compiler, or the programmer, has to be able to deal with all that.

The case for Additional/Extended COG RAM (+2/4/6/8KB)

Comments