New Hub Scheme For Next Chip

David Betz · 2014-05-21 17:16

Cluso99 wrote: »

Thanks for the CMM info.
So more cog memory would no doubt aid CMM mode too.

How do you come to that conclusion? More space for FCACHE?

Cluso99 · 2014-05-21 17:39

Bill,
I am proposing the additional cog ram would run precisely the same as hubexec. The only difference is that the instruction/data is on cog and so it does not require the hub slot to run.
So no, there is no need for dual port (besides, that also requires additional S & D bits >9).

So, extended cog ram is better than hubexec and LMM in any implementation. There are absolutely no deficiencies over hubexec or LMM. Period !

Bill Henning wrote: »

Nope. Very logical argument.

JMPRET is used for subroutine calls for cog mode (remember, no LIFO stack, no CLUT stack)

Single port cog memory would require an additional cycle for every JMP/branch to write the destination address, so all branches would be 50% slower. Not worth it, better keep cog memory dual ported.

FJMP/FCALL are horribly expensive in LMM modes (compared to hubexec), but do not need self modifying code for subroutines.

CALL in hubexec does not need self modifying code.

Using INDA/B (or X/Y) for stack with additional instructions slows things down a lot (compared to dual ported cog memory).

With hubexec (as long as there is a FIFO or some caching) we can have huge programs with good performance.

For drivers, 512 cog locations is fine, and besides, with FIFO drivers can stream to/from hub REALLY fast.

Cluso99 · 2014-05-21 17:41

David Betz wrote: »

How do you come to that conclusion? More space for FCACHE?

Because I am fairly sure you could extend the support routines to support you VM model.
If FCACHE is in cog, the yes too.

Cluso99 · 2014-05-21 17:48

BTW, everyone was happy to live with 128KB hub on the old P2 design. But now it seems everyone demands 512KB.

A little of that I think would be more beneficial to have in the cogs. For large programs that require some hub use for program code, they would benefit by being able to place more code/routines in cog.

It would also permit some of the additional cog ram to be used for a stack, rather than go to hub.

Imagine the thrashing going on with GCC. Every time there is a CALL, the return address is placed in a fixed cog register (say $1EF). If the CALL is a ??? type, then the return address then has to be pushed onto the hub stack by software. If the first (say 4KB) of that were in cog, that would no doubt yield a very significant throughput improvement. With the current new hub method, the FIFO has to be bypassed when pushing to the stack and popping from the stack, else you may have stale data in the FIFO.

Bill Henning · 2014-05-21 17:49

I missed that part (re uses hubexec style instructions), maybe CALLA/B, RETA/B for using INDA/INDB as stack pointers.

One defficiency: makes hub smaller

Mind you... if cog ram can be increased WITHOUT decreasing the hub size (maybe bit bigger die)... say 512KB hub, and adding 256 long lut, and another 256 longs to each cog (so 512 regular registers, 512+ cogexec longs)... that would be sweet.

Cluso99 wrote: »

Bill,
I am proposing the additional cog ram would run precisely the same as hubexec. The only difference is that the instruction/data is on cog and so it does not require the hub slot to run.
So no, there is no need for dual port (besides, that also requires additional S & D bits >9).

So, extended cog ram is better than hubexec and LMM in any implementation. There are absolutely no deficiencies over hubexec or LMM. Period !

Cluso99 · 2014-05-21 17:52

Personally, I am rapidly coming to the conclusion that apart from some special uses, the new hub method is way too complex, and uses way too much silicon. The sweet spot cannot be determined either, so code cannot take advantage of it. I love the transfer rate it offers, but that seems to be the only advantage in an otherwise complex solution.

Bill Henning · 2014-05-21 17:52

Due to reduction in pins, and loss of XFER (sdram), having as big as possible hub is very important.

I loved having sdram on the P2, tons of memory for 1080p 24bpp, large capture buffers, tons of xmm code space... sniff.

I thought the latest P2 design was 256KB?

Cluso99 wrote: »

BTW, everyone was happy to live with 128KB hub on the old P2 design. But now it seems everyone demands 512KB.

A little of that I think would be more beneficial to have in the cogs. For large programs that require some hub use for program code, they would benefit by being able to place more code/routines in cog.

It would also permit some of the additional cog ram to be used for a stack, rather than go to hub.

Imagine the thrashing going on with GCC. Every time there is a CALL, the return address is placed in a fixed cog register (say $1EF). If the CALL is a ??? type, then the return address then has to be pushed onto the hub stack by software. If the first (say 4KB) of that were in cog, that would no doubt yield a very significant throughput improvement. With the current new hub method, the FIFO has to be bypassed when pushing to the stack and popping from the stack, else you may have stale data in the FIFO.

Cluso99 · 2014-05-21 17:57

Bill Henning wrote: »

I missed that part (re uses hubexec style instructions), maybe CALLA/B, RETA/B for using INDA/INDB as stack pointers.

One defficiency: makes hub smaller

Mind you... if cog ram can be increased WITHOUT decreasing the hub size (maybe bit bigger die)... say 512KB hub, and adding 256 long lut, and another 256 longs to each cog (so 512 regular registers, 512+ cogexec longs)... that would be sweet.

If you are willing to accept the reduction of 512 registers to 256 registers, then together with the LUT, you can have 4KB (of which 256 is registers too) of cog space, for precisely the same silicon space ! ie No encroachment on hub ram !

I am still concerned (although Chip has indicated otherwise) that there will not be enough space for 512KB of hub ram. If that happens, then it will fall by necessity in a rather large chunk. Maybe there may be enough space to add another 1KB or 2KB to the cogs... no-one knows currently.

Bill Henning · 2014-05-21 18:10

I'd rather lose some cogs than hub, but I don't think either is needed, as Chip has indicated that the die area can grow (from 7mm2 to 9mm2)

I don't want to lose registers, as we need it for local stack space.

But if we can't get significantly more hub memory (say at least another 128KB) I'd rather add lut/exec area to cogs than a few KB to the hub. For example, instead of getting another 64KB hub, I'd rather add 4KB of your single port exec memory to each cog. If that 4KB could be used as the stack for hubexec, it would greatly improve the performance of medium size C programs.

Cluso99 wrote: »

If you are willing to accept the reduction of 512 registers to 256 registers, then together with the LUT, you can have 4KB (of which 256 is registers too) of cog space, for precisely the same silicon space ! ie No encroachment on hub ram !

I am still concerned (although Chip has indicated otherwise) that there will not be enough space for 512KB of hub ram. If that happens, then it will fall by necessity in a rather large chunk. Maybe there may be enough space to add another 1KB or 2KB to the cogs... no-one knows currently.

David Betz · 2014-05-21 18:10

jazzed wrote: »

Eric's CMM definition:

https://code.google.com/p/propgcc/source/browse/doc/CMM.txt

CMM will still be important with the new chip even with 512KB HUB RAM.

Possibly but the speed difference between hubexec and CMM in P2 will be much greater than the difference between LMM and CMM in P1.

Roy Eltham · 2014-05-21 18:13

Because the memory is synthesized along with the cog logic and whatnot, they can get pretty good estimates of the total size needed for it all. And since the area they need to fit into is a square area inside the pins, it's all very simple now instead of the odd shape the synthesis used to be when they were using their own memories. Also, since Chip knows about the power issues now, they are keeping a good handle on it. I suggest that you can be less worried about power and space issues with this go around.

Cluso,
I understand you have this strong desire to get more memory inside the cogs, but I really dislike the path you have been pushing. It just creates another "zone" for code to live in that is limited, and you are suggesting sacrificing cog register space in the process, which I really really (really!!!) do not like. I think the 512 long register/code space for cogs is a sweet spot for doing hardware drivers and timing sensitive things, and I think having HUB memory be as large as possible is very desirable.

I much prefer methods that leave the cogs being as simple and straightforward as possible, and then do things to make executing code from HUB at efficient as possible without making things to complex for the users. Chip seems to be strongly in that camp as well. I know he'd rather not have to do hubexec at all, because it involves a huge pile of extra instructions as well as complexity into the cogs in order to support it. I think he would love it if there was a simpler/smaller subset of stuff we could do that would make LMM more efficient and go with that instead of hubexec.

jmg · 2014-05-21 18:17

Bill Henning wrote: »

I thought the latest P2 design was 256KB?

Current 'target' is 512K on die, and less in FPGA builds. - but a lot is changing, so 512k is (?)

jmg · 2014-05-21 18:22

Roy Eltham wrote: »

...
I much prefer methods that leave the cogs being as simple and straightforward as possible, and then do things to make executing code from HUB at efficient as possible without making things to complex for the users. Chip seems to be strongly in that camp as well. I know he'd rather not have to do hubexec at all, because it involves a huge pile of extra instructions as well as complexity into the cogs in order to support it

Yes, My take is HubExec is by no means guaranteed, and whilst it is nice to run code from LUT, (given that is zero-additional cost RAM - ie already there) if there is no HubExec, that desire becomes rather moot.

Best to wait to see if HubExec is going to be real, and what opcodes it brings, and then see if Code can run from LUT.

David Betz · 2014-05-21 18:25

Roy Eltham wrote: »

Because the memory is synthesized along with the cog logic and whatnot, they can get pretty good estimates of the total size needed for it all. And since the area they need to fit into is a square area inside the pins, it's all very simple now instead of the odd shape the synthesis used to be when they were using their own memories. Also, since Chip knows about the power issues now, they are keeping a good handle on it. I suggest that you can be less worried about power and space issues with this go around.

Cluso,
I understand you have this strong desire to get more memory inside the cogs, but I really dislike the path you have been pushing. It just creates another "zone" for code to live in that is limited, and you are suggesting sacrificing cog register space in the process, which I really really (really!!!) do not like. I think the 512 long register/code space for cogs is a sweet spot for doing hardware drivers and timing sensitive things, and I think having HUB memory be as large as possible is very desirable.

I much prefer methods that leave the cogs being as simple and straightforward as possible, and then do things to make executing code from HUB at efficient as possible without making things to complex for the users. Chip seems to be strongly in that camp as well. I know he'd rather not have to do hubexec at all, because it involves a huge pile of extra instructions as well as complexity into the cogs in order to support it. I think he would love it if there was a simpler/smaller subset of stuff we could do that would make LMM more efficient and go with that instead of hubexec.

What additional opcodes are required for hubexec other than the 17 bit versions of JMP, CALL, and RET and some way to load a 32 bit constant?

Cluso99 · 2014-05-21 18:32

Roy Eltham wrote: »

Because the memory is synthesized along with the cog logic and whatnot, they can get pretty good estimates of the total size needed for it all. And since the area they need to fit into is a square area inside the pins, it's all very simple now instead of the odd shape the synthesis used to be when they were using their own memories. Also, since Chip knows about the power issues now, they are keeping a good handle on it. I suggest that you can be less worried about power and space issues with this go around.

Cluso,
I understand you have this strong desire to get more memory inside the cogs, but I really dislike the path you have been pushing. It just creates another "zone" for code to live in that is limited, and you are suggesting sacrificing cog register space in the process, which I really really (really!!!) do not like. I think the 512 long register/code space for cogs is a sweet spot for doing hardware drivers and timing sensitive things, and I think having HUB memory be as large as possible is very desirable.

I much prefer methods that leave the cogs being as simple and straightforward as possible, and then do things to make executing code from HUB at efficient as possible without making things to complex for the users. Chip seems to be strongly in that camp as well. I know he'd rather not have to do hubexec at all, because it involves a huge pile of extra instructions as well as complexity into the cogs in order to support it. I think he would love it if there was a simpler/smaller subset of stuff we could do that would make LMM more efficient and go with that instead of hubexec.

If Chip allows the LUT to be used as cog program space as I have suggested, together with the minimal jump versions becoming relative, I will prove to you that what you are saying is untrue. It will be as simple as it was to write the hubexec program that I wrote on the old P2 fpga code. There is really no thinking involved. It just works.

I am fine if everyone wants 496 registers. I merely point out, that when I prove what I am saying, we can...
1. Get another 1KB of program (and stack) space by reducing to 256 registers
2. Currently we don't want to lose any space to more configuration registers, so Chip has resorted to a lot of additional instructions. This will reduce this pressure and quite likely result in less (therefore) simpler instructions.
3. Any increase in cog ram will aid more complex drivers without resorting to hub based solutions.

But Roy, I get you cannot see it. I just cannot seem to explain it any simpler than I have.

Bill Henning · 2014-05-21 18:33

Roy Eltham wrote: »

I know he'd rather not have to do hubexec at all, because it involves a huge pile of extra instructions as well as complexity into the cogs in order to support it. I think he would love it if there was a simpler/smaller subset of stuff we could do that would make LMM more efficient and go with that instead of hubexec.

Roy,

1) See my analysis earlier. LMM has a huge built in performance penalty, which can be helped by the fifo, however it is still at least 2x slower than hubexec.

2) What huge pile of instructions? I have:

JMP
CALL
RET
LOCPTRA
LOCPTRB
LOCINS

When the previous P2 had seven variations each for JMP/CALL/RET, I could understand it being called a lot.

For this design iteration, I'd like to see just three variations - LR version, INDA version, PTRA version of CALL/RET. That's it.

I could live with just two (INDA & PTRA) but I realize the gcc guys really want LR.

The PTRA version is useful for large compiled code, and also for accessing local variables on the hub stack.

The INDA version is very useful for large pasm programs.

The LR version is useful for GCC.

INDX/Y no longer exist, so those versions are not needed.

So the list would be:

JMP
CALL (three variations)
RET (two variations, JMP LR is the same as RETLR)
LOCPTRA
LOCPTRB
LOCINS

That is 9 instructions total.

PTRB ... well, one hub stack is enough per cog.

David Betz · 2014-05-21 18:35

Bill Henning wrote: »

Roy,

1) See my analysis earlier. LMM has a huge built in performance penalty, which can be helped by the fifo, however it is still at least 2x slower than hubexec.

2) What huge pile of instructions? I have:

JMP
CALL
RET
LOCPTRA
LOCPTRB
LOCINS

When the previous P2 had seven variations each for JMP/CALL/RET, I could understand it being called a lot.

For this design iteration, I'd like to see just three variations - LR version, INDA version, PTRA version. That's it.

I could live with just two (INDA & PTRA) but I realize the gcc guys really want LR.

The PTRA version is useful for large compiled code, and also for accessing local variables on the hub stack.

The INDA version is very useful for large pasm programs.

The LR version is useful for GCC.

INDX/Y no longer exist, so those versions are not needed.

PTRB ... well, one hub stack is enough per cog.

Why are LOCAPTRA, LOCPTRB, and LOCINS needed?

Roy Eltham · 2014-05-21 18:36

There are like 4-5 versions of every call/jmp/ret. With combinations of immediate absolute, immediate relative, and register based, as well as A and B variants using the different indexing registers for the stacks. There would also be all of the push/pop variants. I think there are also some considerations to be made in existing instructions for use with hubexec.

Chip showed me a list of instructions, and he had *'s on all the ones needed for hubexec, and it was like 1/3 to 1/2 of the list.

Cluso99 · 2014-05-21 18:40

David Betz wrote: »

What additional opcodes are required for hubexec other than the 17 bit versions of JMP, CALL, and RET and some way to load a 32 bit constant?

1. DJNZ, etc (conditional jumps) need to be relative +/-127

2. PC needs to increase to 17 bits

3. Address mapping: Cog is $000-$1FF (if you don't increase cog ram); Hub is $200-$7FFFF. Rom $000-$1FF is hidden.

4. JMP/CALL/RET Relative/Absolute 17 bit versions

5. LOAD 32 bit constant. I suggest a 4 clock double long instruction where the immediate value follows the 32bit LOAD instruction. (But I can live without it if necessary)

That's all that is necessary. There maybe some niceties, but we don't even have the above in LMM.

Bill Henning · 2014-05-21 18:40

To save memory.

AUGS var
SETPTRA #var

vs

LOCPTRA

Same for LOCPTRB

This will save a lot of memory as whenever a compiler needs to reference a long based variable, or function address for a function pointer, four bytes are saved. Ditto for every array reference.

Chip also had

AUGS var
LOCINS D,#s

David Betz wrote: »

Why are LOCAPTRA, LOCPTRB, and LOCINS needed?

David Betz · 2014-05-21 18:42

Roy Eltham wrote: »

There are like 4-5 versions of every call/jmp/ret. With combinations of immediate absolute, immediate relative, and register based, as well as A and B variants using the different indexing registers for the stacks. There would also be all of the push/pop variants. I think there are also some considerations to be made in existing instructions for use with hubexec.

Chip showed me a list of instructions, and he had *'s on all the ones needed for hubexec, and it was like 1/3 to 1/2 of the list.

I don't believe all of those instructions are needed. The relative branches aren't really *needed* but are nice. The stack versions aren't needed by PropGCC and aren't really needed at all since you can always push the LR register with a separate instruction if necessary.

David Betz · 2014-05-21 18:43

Cluso99 wrote: »

1. DJNZ, etc (conditional jumps) need to be relative +/-127

2. PC needs to increase to 17 bits

3. Address mapping: Cog is $000-$1FF (if you don't increase cog ram); Hub is $200-$7FFFF. Rom $000-$1FF is hidden.

4. JMP/CALL/RET Relative/Absolute 17 bit versions

5. LOAD 32 bit constant. I suggest a 4 clock double long instruction where the immediate value follows the 32bit LOAD instruction. (But I can live without it if necessary)

That's all that is necessary. There maybe some niceties, but we don't even have the above in LMM.

If there is still an AUGS instruction you don't even need to modify DJNZ.

Bill Henning · 2014-05-21 18:44

Since when does a different addressing mode could as a separate instruction??????

abs/rel on addressing allows for position independent code, but it is not strictly necessary - but makes life a LOT easier, and allows loading binary blobs of code

See above, 3 versions of call, 2 of ret, I do not count abs/rel as a different instruction. That's plain silly.

Roy Eltham wrote: »

There are like 4-5 versions of every call/jmp/ret. With combinations of immediate absolute, immediate relative, and register based, as well as A and B variants using the different indexing registers for the stacks. There would also be all of the push/pop variants. I think there are also some considerations to be made in existing instructions for use with hubexec.

Chip showed me a list of instructions, and he had *'s on all the ones needed for hubexec, and it was like 1/3 to 1/2 of the list.

David Betz · 2014-05-21 18:45

Bill Henning wrote: »

To save memory.

AUGS var
SETPTRA #var

vs

LOCPTRA

Same for LOCPTRB

This will save a lot of memory as whenever a compiler needs to reference a long based variable, or function address for a function pointer, four bytes are saved. Ditto for every array reference.

Chip also had

AUGS var
LOCINS D,#s

I didn't say they weren't useful. I just said they aren't necessary. Would you rather have a simplified hubexec or none?

Cluso99 · 2014-05-21 18:46

Roy,
That's because Chip kept saying they were easy, so we all jumped on the bandwagon. And turns out they were not all that easy. But the overall killer was the multi-tasking because that required having 4 ALU sections at various sections of the pipeline, and then combined with things like AUGD/S, LOC, all the JMP/CALL/RET variants including the "D"elayed versions we no longer require. Currently, while nice, I am not even asking for stacks. I can live with the GCC requirement of the return address being placed in a fixed register.
All these things have been worked around with LMM. LMM was invented by Bill to overcome a shortcoming of insufficient cog ram.

Roy Eltham wrote: »

There are like 4-5 versions of every call/jmp/ret. With combinations of immediate absolute, immediate relative, and register based, as well as A and B variants using the different indexing registers for the stacks. There would also be all of the push/pop variants. I think there are also some considerations to be made in existing instructions for use with hubexec.

Chip showed me a list of instructions, and he had *'s on all the ones needed for hubexec, and it was like 1/3 to 1/2 of the list.

Bill Henning · 2014-05-21 18:50

Based on what Chip wrote before, implementing these instructions takes extremely few gates.

Using less hub memory, and all code being faster due to one less memory fetch, makes them a no brainer.

Adding caching (on top of FIFO) is complex and takes many gates.

Any feature that takes a lot of resources (gates) needs to be carefully considered.

A feature that takes almost nothing, but saves precious hub memory, is pretty much a no brainer.

If absolutely necessary, I could see dropping LOCINS, but again, my mind boggles at dropping trivialities.

David Betz wrote: »

I didn't say they weren't useful. I just said they aren't necessary. Would you rather have a simplified hubexec or none?

Roy Eltham · 2014-05-21 18:51

Cluso,
Please stop telling me I don't get it. I DO GET IT. I DISAGREE WITH YOU. PERIOD.

You seem to gloss over that your full thing involves multiple new zones for code to live in. Sure it's easy to code for those new zones, but it's more code separations and memory separations. You have COG proper PASM in registers, then cog code only no self modify or rets, then this extra cog memory for hubexec-like use, and then hub memory (with hubexec presumably, or LMM).

I don't want to have to write code to comply with 4 memory zones (or even 3 if you drop the extra-cog-memory hubexec-like stuff). You are complicating things and you don't even see it.

You can show me examples of "simple" code in your scheme until you turn blue, I don't care. The limitations of single port executable cog codespace suck. Coders will have to worry about it, because it's part of how the Propeller works to be able to do self-modifying stuff including JMPRETs. With your method you can only do that in a subset of the codespace. So now the compiler will just start spewing errors at you because you happen to have perfectly valid PASM code in the wrong memory zone.

Also, the "extra cog memory for hubexec like use" sucks because, what happens to that memory when I am using the cog as a driver and doing only native PASM? I lose all the memory which should be in HUB.

Again, I totally understand what you are pushing for, and I totally do not like it and disagree with you about it.

David Betz · 2014-05-21 18:52

Bill Henning wrote: »

Based on what Chip wrote before, implementing these instructions takes extremely few gates.

Using less hub memory, and all code being faster due to one less memory fetch, makes them a no brainer.

Adding caching (on top of FIFO) is complex and takes many gates.

Any feature that takes a lot of resources (gates) needs to be carefully considered.

A feature that takes almost nothing, but saves precious hub memory, is pretty much a no brainer.

If absolutely necessary, I could see dropping LOCINS, but again, my mind boggles at dropping trivialities.

If they are indeed trivial then I agree. I was just responding to the post complaining that hubexec added a huge number of instructions.

Bill Henning · 2014-05-21 18:53

Am I the only one that finds it ironic that I came up with LMM, and I am working hard to kill it?

I've been chuckling about that one since Chip said "I am trying very hard not to think of executing from the hub".

Cluso99 wrote: »

All these things have been worked around with LMM. LMM was invented by Bill to overcome a shortcoming of insufficient cog ram.

Cluso99 · 2014-05-21 18:53

IMHO, irrespective of anything else...

DJNZ & friends should be relative !

JMPRET should at least have a relative mode (for the S = goto address anyway. There is an argument both ways for the D return address.

This is important for relocatable code. It should have been in P1, but Chip didn't conceive code running anywhere but cog.

New Hub Scheme For Next Chip

Comments