A solution to increasing cog ram - works with any hub slot method

Cluso99 · 2014-05-14 07:48

Inspired by Chip & Roy's new hub slot idea, and thinking about hubexec, I came up with the following idea...

Cog ram could in fact be any size.
The lower 2KB ($0-1FF) works as now.
Addresses above 2KB works just like hubexec except it is private to the cog. It can be acessed at full cog speed.
Above 2KB would typically be used for instructions, but with new MOVX instruction (or a varied MOV instruction) could be used as data.
Presuming the lower 2KB remains 2-port, ram above 2KB could remain single port.
The program counter would be extended to "n" bits to cater for the additional cog ram.

JMP/CALL/RET would require a change to permit relative and or direct jump or calls.

It may still be useful to have a reasonable depth internal LIFO (say 16+) and JMP/CALL/RET/PUSH/POP support instructions.

Execution would run at full cog speed.

Note: This does not preclude the use of hubexec from hub, nor any of the various hub slot methods including todays method posted by Chip.

Mike Green · 2014-05-14 09:01

I'm concerned about the limited access to anything above 2KB and the cog's reliance on instruction modification. What do you see that additional memory used for? What type of programs do you see making use of it? How does that compare to using RDBLOC for cache-like execution of small blocks of code?

If you see the additional memory used mostly for data, would a MOVX instruction be enough? Would this perhaps be mostly for buffers?

Cluso99 · 2014-05-14 10:17

Mike Green wrote: »

I'm concerned about the limited access to anything above 2KB and the cog's reliance on instruction modification. What do you see that additional memory used for? What type of programs do you see making use of it? How does that compare to using RDBLOC for cache-like execution of small blocks of code?

If you see the additional memory used mostly for data, would a MOVX instruction be enough? Would this perhaps be mostly for buffers?

Mike,
I see the main use for the additional cog memory for code execution. It would be used like hubexec, but at full cog speed.
Codde running from hub in hubexec cannot run at full speed due to jumps and calls (even if Chip implements the new hub and we interleave code).

Of course it could also be used for video buffering too.

RDBLOCK would permit fast loading, so overlays could also be used to advantage.

LMM mode did away with self-modifying code so I don't believe this is an issue.

I have been thinking about the ramifications if all 16 cogs had an additional 32KB cog private hub, and there was a 32KB shared hub with the new hub access method. Large block transfers between cogs could be synchronised and both could run at full clock speed using an RDBLOCK variant.

Inn other words, I am leaning towards each cog being a seperate cpu of 32KB with a shared 32KB parallel fast hub. Each cog should be able to approach 100 MIPS sustained at 200MHz.

Mike Green · 2014-05-14 10:42

LMM did not do away with self-modifying code. With hub accesses, this isn't an issue because of the indirection involved in the hub addressing, but the cog accesses are direct and require instruction modification. If you have any instructions above $200, there's no way to modify them if there's any indexing needed other than keeping copies in the 1st $200 and copying them as needed into place above $1FF. That puts significant limits on the way you can use memory above $1FF for code.

Dave Hein · 2014-05-14 10:47

cluso, I like your idea. Mike, self-modifying code can be relegated to the first 512 longs of memory. I don't see that as a major limitation. That's how it works with hubexec.

Cluso99 · 2014-05-14 11:11

Mike,
LMM doesn't use self modifying code for any code within hub.The LMM execution unit resides in lower cog and does use self-modifying code, but this hasn't changed in my solution.

Thanks Dave.

Mike & Dave,
What are your thoughts about big cogram and small hub,just used for cog-cog transfers ? eg cogs 32KB and hub 32KB

jmg · 2014-05-14 12:30

Cluso99 wrote: »

Inn other words, I am leaning towards each cog being a seperate cpu of 32KB with a shared 32KB parallel fast hub. Each cog should be able to approach 100 MIPS sustained at 200MHz.

The down side of this, is any Data storage > 32k or any single Program > 32k, are off the table, and surely that is a very large cripple of the chip ?

jazzed · 2014-05-14 12:39

jmg wrote: »

The down side of this, is any Data storage > 32k or any single Program > 32k, are off the table, and surely that is a very large cripple of the chip ?

That was my initial reaction also.

However, if it doesn't cost anything in hardware, then why not? Many "cognew threads" of execution will not be too big for 32KB.

mark · 2014-05-14 12:49

Why would the block sizes have to be fixed? What if it was numerous 4k blocks that could be added together (to create a continuous address range) and assigned to cogs at run-time? Being able to assign blocks at run time would also allow cog-to-cog data transfers.

That said, I have no idea what the block control logic would look like.

Cluso99 · 2014-05-14 13:13

My first thoughts were that I would want a large contiguous hub space. But thinking more, I thought fast large cogs might have more practical use. We can transfer cog to cog via hub overlapping the transfer, and can do this at 4 bytes / clock = 800MB/s.

It's possible there may be enough silicon for 64KB or 128KB hub with cogs of 32KB+2KB each.
Or 256KB hub and 16 cogs of 16+2KB each.

Would either of these sound feasible?

Cluso99 · 2014-05-14 13:15

[QUOTE=mark

mark · 2014-05-14 13:26

Cluso99 wrote: »

It's too costly silicon wise because of the busses involved.

You're most likely right, but chip managed to reduce the complexity of his current scheme over what would typically be expected, so who knows? You might be onto something with:

"It's possible there may be enough silicon for 64KB or 128KB hub with cogs of 32KB+2KB each.
Or 256KB hub and 16 cogs of 16+2KB each. "

or some variation of it.

jmg · 2014-05-14 14:17

Cluso99 wrote: »

It's possible there may be enough silicon for 64KB or 128KB hub with cogs of 32KB+2KB each.
Or 256KB hub and 16 cogs of 16+2KB each.

Would either of these sound feasible?

I cannot see limiting the size of Data makes any sense, given most of those private COG memories will go unused.
I can see the need to increase COG RAM, but not at the expense of shared HUB areas.

Doubling COG RAM could make some sense, but there are ways to use the new Hub Rotate to give more granular access to buffers/fifos. which is one area where the push for more-COG-RAM comes from.

ie if you can easily move Buffers/Fifio into the HUB and be under 16c Granular, then that frees COG RAM for CODE use more than Buffers..

mark · 2014-05-14 14:57

Cluso99 wrote: »

It's too costly silicon wise because of the busses involved.

Depending on how complex (read: how much die space it takes up) the logic is for the current scheme, maybe it's not entirely unreasonable?

I have no idea how the hub ram is physically organized on the die, but I'd imagine it be not unlike it is on the current prop, just broken up vertically. That said, consider that the data and address lines propagate through entire ram blocks, therefore a block can also be considered a bus. If the hub logic can be replicated after each block, you might be able to cut down on the number of buses that would typically be required.

Kerry S · 2014-05-17 10:01

How about we just put in real D and S registers and then the whole memory limitation goes away. Trying to stuff an entire instruction into 32 bits just so we can have self modifying code is creating a huge bottleneck here. With more cog RAM you should not need to do that and I bet, as smart as Chip is, he could figure out a way to still do it with some selection scheme for "use register" vs "use inline" in the op codes.

Cluso99 · 2014-05-17 10:42

There really aren't any major issues that cannot be solved simply.

Now that hubexec has been solved once, it is just a matter of cutting that down to simpler solution.

In reality, extended cog ram can be treated the same as hub memory without going via the hub slot mechanism.
It is also the way traditional risc micros work...
* they have registers (a cog/core has 512)
* they have program memory (often flash)
* they have data memory (often ram)
* sometimes program and data memory are the same (ram)

The props program memory is non-traditional (it runs in the registers).
Hubexec uses hub ram to extend the program memory, so it can be thought of as equivalent program memory.

So it is really a matter of thinking of a cog as a traditional risc micro that:
* has 512 registers (it can also use this as program memory)
* extended ram as program memory and data memory
* hub ram as additional program and data memory (ignore that its shared and the hub aspect)

Now, all of a sudden, the cog/core becomes simpler to understand, and the 9bit S and D fields only refer to the 512 registers.
Now, there is no need to expand from 9 bits. The only requirement is for relative/direct jump/call/return instructions, and possibly a new move (load/store) instruction although rdxxxx/wrxxxx instructions could be used.

The memory model would use...
* $000-1FF as register/program/data
* $200-xxxx as cog/core program/data
* $xxxx-yyyy as hub program/data

It really is this simple!!!

Kerry S · 2014-05-17 11:08

Cluso99 wrote: »

There really aren't any major issues that cannot be solved simply.

<great explanation here>

It really is this simple!!!

Thank you for that. Very well put!

The key is to get away from this idea that all non cog memory is hub memory. We do need your third type. Extended Cog that is private to that cog. Solves the issues and is simple.

I like simple, does not make my brain hurt so much trying to grok it all...

evanh · 2014-05-17 15:58

Umm, that's exactly how Hubexec was intended all along. However, the sharing nature of HubRAM access provided many challenges to streamlining the implementation.

Presumably Chip is going to reimplement Hubexec with the latest crosspoint switching system and, given it's potential for lower average latency, will presumably also drop most of the caching mechanisms he was having headaches with.

koehler · 2014-05-17 16:43

So, we would in effect have 16 Core's with 2K Core RAM, and nK of Core-like (speed/determinism is equal?) RAM up to 512K, sort of?

Cluso's idea seemed to eliminate the random address timing issue, does this?

Given Chip 20-stage FIFO, since it also affects HubExec, does that reduce the timing issue?

jmg · 2014-05-17 17:03

koehler wrote: »

Cluso's idea seemed to eliminate the random address timing issue, does this?

I don't think local RAM eliminates anything, it just means you can do a little more locally, so need the HUB less.
However, for data that is too large the the Local RAM, (or shared) the issue remains.

RossH · 2014-05-17 18:24

So now we would have three types of RAM, each with different instructions required to access them? Or possibly four if you count LUT (not sure if this is still in the current design or not)?

Hmmmm.

Ross.

evanh · 2014-05-17 20:34

From what I can make out it's no change in where RAM is located. All Cluso is saying is that Hub space is instruction fetchable and even mapped to Cog space, which is exactly what Hubexec provides.

Heater. · 2014-05-18 00:07

RossH,

So now we would have three types of RAM, each with different instructions required to access them? Or possibly four if you count LUT (not sure if this is still in the current design or not)?

That's not a problem, the compilers can sort all that out, or rather the compiler writers:)

That's how they handled such architectural complexity in the Intel i860 and Itanium. See what great successes they were!

RossH · 2014-05-18 01:50

Heater. wrote: »

That's not a problem, the compilers can sort all that out, or rather the compiler writers:)

Sorry, not this little black duck!

Heater. · 2014-05-18 02:05

None of the GCC guys either I would imagine.

Baggers · 2014-05-18 02:17

Don't forget when doing 256K HUB-RAM with 16x16K COG-RAM is going to lose a large chunk of ram, a lot of the drivers in cogs won't use 16K of ram, I mean some of the ones we have now don't even use 1K let alone 2K

I know you're still thinking of a way to get some form of HUB-EXEC but I don't think this is a good use of the available ram, with it not being accessible to all cogs.

Cluso99 · 2014-05-18 02:28

I am back home on my PC. That xoom is terrible for editing on this forum. You cannot place the cursor where you want, no matter how hard you try. I believe its an Android problem. BTW my iPad Mini Retina works fine but I didn't have it long before it got stolen (its now in Korea with my wife

).

I intend to post a few diagrams as to how this would work. But basically, it would just be a contiguous address space (potentially with some blocks unavailable for later expansion. This would be just as Chip has done with hubexec.
$000-1FF : cog ram (registers/program/data)
$200-xxx : extended cog ram (program/data)
$xxx+1 - yyyy-1 : reserved for additional extended cog ram (future)
$yyyy-zzz : shared (hub) ram (program/data)

Now, I would expect the hub rom $000-yyyy-1 would not be normally accessible after boot. This way no remapping of hub addresses would be necessary.

The extended cog ram would run at the same speed as cog register ram. ie It is full cog speed 200MHz 2 clock instructions. There would be no hub slot delay because there is no hub slot mechanism to access this ram. Its absolutely deterministic.
The only difference is that the extended cog ram would be single port (because its not registers), so doesn't need to be available for both D & S.

If the extended cog ram was 2/4/6/8KB just maybe it wouldn't reduce hub space of 512KB. Of course it depends on silicon area for everything else.
Now we could have larger cog programs... an additional 2KB would provide more than 2x previous program space (because the 2KB cog has some fixed registers, and a program always needs some workarea registers). 4KB would give more than 3x, etc.

I do wonder if this extended cog ram, or the existing cog ram, could be used for the CLUT with restrictions.
Maybe if the CLUT used existing cog ram $000-0FF then the program would need to be run from extended cog ram or hub ram (because the CLUT ram needs a read port to forward data to the DAC.
Maybe if the CLUT used the extended cog ram, then the program would need to be run from existing cog ram or hub ram.
I cannot answer this, but I can pose possibilities.

Why would the P2 have some extended cog ram?
* To provide additional program/data space to the cog that runs full speed and no latency and without any hub slot mechanism.
* This space can be used to write more complex drivers without the problems (speed and latency and maybe jitter) that hubexec would require.
* Extended cog ram could be used as stack space (either by sw or with additional hw instructions).

Large programs would still need to use hub ram, but more code routines could be stored in cog ram and/or extended cog ram.

Baggers · 2014-05-18 02:38

It won't need this now anyway Ray, Chip's come up with a plan to have hub-exec also

Cluso99 · 2014-05-18 02:55

It would still be faster and simpler and IMHO is the WTG for some small amount of ram 2KB-8KB.
It would also use somewhat less power - no hub access required.

The GCC guys need to have a stack in hub ram. This will slow things a lot and there is probably no way around this. This additional cog ram could be used for that, and the saving could check for overflow and move to hub if required. It would be immensely faster.

jmg · 2014-05-18 03:27

Cluso99 wrote: »

It would still be faster and simpler and IMHO is the WTG for some small amount of ram 2KB-8KB.

A question that may arise is (eg) are 14 COGs going to be just as useful as 16 ?
With the extra features being discussed, would 14 COG+ be better than 16 COG ?

A solution to increasing cog ram - works with any hub slot method

Comments