The case for Additional/Extended COG RAM (+2/4/6/8KB)

jmg · 2014-05-22 14:24

Cluso99 wrote: »

4. Do you believe all cogs must be equal in terms of cog ram?
I think some cogs need more memory so they can do the grunt work.

Exploring this makes sense, but it depends first on getting HubExec, and some idea of total Memory Impacts of the newest respin.
The obvious first step, is to add Code-run from LUT, as that has no Memory trade-offs- it better uses what is already there. Chip has indicated he will look at this

Note with the newest hardware, you can stream faster to the HUB than you can to a COG, and large HUB is very flexible memory.

evanh · 2014-05-22 14:57

Heater. wrote: »

Did we really get as low as 128K in the old P2? That is really not good enough,

Yup, that first shuttle run was with only 128k hubRAM. I certainly wasn't happy at the time, given we'd already given up on having 16 cogs earlier. 256k hubRAM only came when Chip stripped out the global DAC bus around the same time as threading was added.

Only as hubexec came into focus and Chip had to keep stuffing in more caching and muxing, the quad-wide hub bus, and special instructions did things become difficult for him.

Belatedly, and somewhat independently, the thermal issue caused him to rethink the whole approach.

Roy Eltham · 2014-05-22 15:12

My projects vary all over the place. On the P1, I mostly use PASM/Spin, but I have been working on a project that is completely in C/C++ using propgcc/simpleIDE (it uses cogc for several drivers, and then the main code in C/C++). That all C/C++ project is using an external SPI ram and XMMC mode to be able to fit on the P1 at all.

For the P2, I imagine I will have some of each as well, but probably more in the C/C++ side since it will handle it better without external memory. I can easily fill up 512K with C/C++ code/data for any number of projects (pinball machine, medium sized robot with whole house mapping & intelligent navigation). With hubexec, that code will run fast, and I can use multiple cogs running the same code in parallel (if it's in HUB memory) operating on different parts of the data.

I was never happy with the 128K hub on the early P2 design, it was 256K hub after they dropped the full cog to pins bus allowing all cogs to talk to all pins for DACs (aka 9bit data path to all pins). 256K was ok, but still not enough, it was alleviated by the built in SDRAM support and enough pins to have a large SDRAM and still have plenty of I/Os. 512K is good, but I would still prefer even more HUB ram. I look forward to a future chip someday that has multiple megabytes of shared hub memory... perhaps even having the HUB memory being extendable seamlessly with external memory!

For me, my most typical use case for the Prop architecture is to have several special hand written drivers spread around the available cogs, and then one cog running some main/master code that interacts with those drivers. Occasionally, I will use a few cogs working together on a single task (like capturing a frame from a camera module).

Yes it would be nice if we could have more memory per cog, but I would want it to be uniform and all registers. Allowing the same type of coding style throughout all of it (self modifying, and all). The path i would explore would be to figure out if we can reduce the instruction set enough to free up 2 bits and get S and D to be 10 bits each (thus doubling cog ram size). Perhaps some other changes to the opcodes could work? I don't think it's feasable to expand cog memory to more bits (above 32) right now.

I really dislike the idea of a Propeller with some cogs being different from others (memory size or otherwise). It bums me out that we lost the uniform all cogs access to the pins for DACs, at least they can still do reasonably high DAC output to all pins using the MSGOUT instruction.

So anyway, the short version: COG memory size is pretty good for hand written drivers and kernels, and hubexec allows for fast large programs, so I don't think we should do weird hacks to get more cog memory that is non-uniform.

evanh · 2014-05-22 15:20

Roy Eltham wrote: »

512K is good, but I would still prefer even more HUB ram. I look forward to a future chip someday that has multiple megabytes of shared hub memory...

One word - MRAM

jmg · 2014-05-22 15:53

Roy Eltham wrote: »

512K is good, but I would still prefer even more HUB ram. I look forward to a future chip someday that has multiple megabytes of shared hub memory... perhaps even having the HUB memory being extendable seamlessly with external memory!

For those needing ever-more-code, what should be practical is QuadSPI (DDR?) Execute in Place (XIP) for the outer most tasks.
That allows the Prop to deliver very fast code via COGS, larger code via HubExec, and really large code,cheaply, via XIP.

MJB · 2014-05-23 07:20

36 bits? That sounds great. Can you implement the PDP-10 instruction set please? :-)

Symbolics LISP Machines also hat 36 bits word length with the 4 extra bits used for the concurrent garbage collector ;-)
I recently scrapped my 3620 :-( with 8MB RAM, after the 160 MB HD crashed.

dMajo · 2014-05-23 09:49

I am ok with current cog/hub ram sizes.
If possible, it will be nice if the LUT can be reutilized for data storage/buffers (no execution of any kind) where it is not used for video/waveforms, if this comes relatively simply.

I think that the only new instruction needed is to switch memory space eg. memswitch (0=cog | 1=lut) to redirect the source and destination (of all other instructions) to the cog registers or LUT memory. Most of variables for various math operations, arrays, buffers can be in LUT and for local self-modifying code the cog registers can be used. Perhaps memsw_s and memsw_d can be used to differentiate the things and allow moves between cog space and lut.

The LUT can be in this way overlapped to the higher 256 (or lower, is the same for me) cog registers so the programmer can keep separate variables in lower (or higher) ones if needs pointers to the LUT-space or (cog)reg-space that can be seen from everywhere.

Simple, if it comes ok, otherwise the LUT ram will be wasted when not used for video ... still ok .... i know many will dislike this.

Roy Eltham · 2014-05-23 10:44

dMajo,
You can't redirect all S and D accesses to LUT memory, because S and D are read at the same time via dual port access. In order to use LUT as data storage, we'd need new instructions to move data between LUT and COG memory or you could possibly redirect one of S or D to be from LUT memory, but that's 2 bits that need to go someplace. maybe with an ALTS/ALTD like setup like you suggested?

Executing LUT memory would be more possible, because the internal PC register can be made bigger and the instruction lookup read can be redirected based on the top bit.

Bill Henning · 2014-05-23 11:02

I like the idea of being able to use the LUT for extra code.

A "RDLUT D,#/S" instruction would be very nice for look up tables

and "WRLUT D,#/S" for initializing it.

Nice and simple, no fancy indexing needed (although that would be nice <grin>)

We would need something like RD/WR lut for setting up and manipulating the LUT anyway, and executing from it would be a nice (hopefully easy) bonus.

Roy Eltham wrote: »

dMajo,
You can't redirect all S and D accesses to LUT memory, because S and D are read at the same time via dual port access. In order to use LUT as data storage, we'd need new instructions to move data between LUT and COG memory or you could possibly redirect one of S or D to be from LUT memory, but that's 2 bits that need to go someplace. maybe with an ALTS/ALTD like setup like you suggested?

Executing LUT memory would be more possible, because the internal PC register can be made bigger and the instruction lookup read can be redirected based on the top bit.

dMajo · 2014-05-23 11:08

Thanks Roy,
I didn't considered the contemporaneous access of S and D.

Than it is perhaps better to just execute-only from it, only because it's already there and only if it is simple.... and keep the rest of the memories like they are now.
I mean this is not something the doctor has ordered you to use, it's simply available for the ones that find it useful. No need to do it at all, it's only to not leave 14 luts on 16 not utilized.... even if I would preferred if it would have allowed data usage (without additional opcodes and extension/overcomplication of the current instruction set).

Cluso99 · 2014-05-23 16:34

I am not sure of the implementation ramifications, but RDLONG/WRLONG could be used to access LUT (additional cog ram) when the hub address is <$300, presuming LUT is $300-3FF.

Bill Henning · 2014-05-23 16:43

True, I guess it will depend on what is simpler/faster for chip ... separate instructions, or more address decoding.

Mind you, it would be easier to program with separate instructions, as otherwise $300 would have to be added to the LUT index before use (AUGS or an add)

Cluso99 wrote: »

I am not sure of the implementation ramifications, but RDLONG/WRLONG could be used to access LUT (additional cog ram) when the hub address is <$300, presuming LUT is $300-3FF.

jmg · 2014-05-23 16:57

Bill Henning wrote: »

True, I guess it will depend on what is simpler/faster for chip ... separate instructions, or more address decoding.

Mind you, it would be easier to program with separate instructions, as otherwise $300 would have to be added to the LUT index before use (AUGS or an add)

Plus this creates 'dead' HUB memory, unless you add an address bit, and extra address bits slow things down, so that favours an opcode-decision over an address one.

evanh · 2014-05-23 18:35

jmg wrote: »

Plus this creates 'dead' HUB memory, ...

Chip might be able to answer that more precisely. As in cutting out a chunk of hubRAM might save some space. The last time this issue was discussed Chip was still using a fixed block hubRAM that he didn't want to resynthesise or something to that effect.

jmg · 2014-05-23 18:39

evanh wrote: »

Chip might be able to answer that more precisely. As in cutting out a chunk of hubRAM might save some space. The last time this issue was discussed Chip was still using a fixed block hubRAM that he didn't want to resynthesise or something to that effect.

I think it has changed to Synthesised Blocks of Memory - the fixed designs are gone.
Those Synthesis RAM-generator macros, are probably going to expect conventional sizes (ie 2*N)

Bill Henning · 2014-05-23 19:26

Hmm... does that mean that there will have to be some rom?

jmg wrote: »

I think it has changed to Synthesised Blocks of Memory - the fixed designs are gone.
Those Synthesis RAM-generator macros, are probably going to expect conventional sizes (ie 2*N)

jmg · 2014-05-23 20:10

Bill Henning wrote: »

Hmm... does that mean that there will have to be some rom?

Good question. From a synthesis handling POV, something like a serial ROM could be useful.
- unless the OnSemi automated tools areclever enough to do what Chip was doing. and hard-bridging a few RAM cells as ROM.

kwinn · 2014-05-23 22:22

jmg wrote: »

Plus this creates 'dead' HUB memory, unless you add an address bit, and extra address bits slow things down, so that favours an opcode-decision over an address one.

Why would it create "dead" hub memory? Unless there is some specific limitation I am not aware of there is no reason hub memory has to start at location 0. The address decoding logic can start at any arbitrary address.

jmg · 2014-05-23 23:31

kwinn wrote: »

Why would it create "dead" hub memory? Unless there is some specific limitation I am not aware of there is no reason hub memory has to start at location 0. The address decoding logic can start at any arbitrary address.

Memory decoding is a sea of binary multiplexers, if you want to offset those binary trees, then you need an adder in the memory address path, which would be a very bad idea, as these need to meet 5ns access times.
Chip has those delays finely tuned, and I doubt he would want to add anything in there at all.

evanh · 2014-05-24 01:19

There's no offsetting needed, just have no RAMs below a certain address. The first 12k addresses, for example, wouldn't hit any hubRAM.

ROM could go there and/or Cog space, and even Cluso's extended Cog space would fit well.

EDIT: The physical distribution of the 16 RAM blocks means that, for this example, the first 768 bytes of each of the 16 blocks would have to be what gets deleted.

Cluso99 · 2014-05-24 02:44

The older P2 Hub addresses $00000-~0DFFF were used as the boot/secure/monitor ROM. The last P2 IIRC only used to ~$00800. So the cog, LUT, etc can be mapped over the ROM space withouyt problems.
This was the method used for JMP/CALL/RET in hubexec mode on the last P2, so there are no problems in doing so. Therefore what is easiest for Chip.

evanh · 2014-05-24 03:36

Cluso99 wrote: »

The older P2 Hub addresses $00000-~0DFFF were used as the boot/secure/monitor ROM. The last P2 IIRC only used to ~$00800. So the cog, LUT, etc can be mapped over the ROM space withouyt problems.

Yeah, what we're pondering is that the ROM bits now won't have to be fitted to the RAM cell dimensions.

Cluso99 · 2014-05-24 06:05

evanh,
Yes, I didn't think about the ROM ramifications. I would expect the standard RAM cells cannot work like Chip/Beau did previously.

By not having real ROM, it saved the bus going to both RAM and ROM blocks.

Maybe there is a way to load some ROM directly into Cog 0's ram on power up, making a simpler interface silicon wise.

Guess we will have to wait for a comment from Chip.

evanh · 2014-05-24 13:36

Such a ROM can, functionally at least, still be fitted along side the SRAM cells using the same muxing. Just, unlike the shuttle run hack, it won't be as big is all.

kwinn · 2014-05-24 18:53

jmg wrote: »

Memory decoding is a sea of binary multiplexers, if you want to offset those binary trees, then you need an adder in the memory address path, which would be a very bad idea, as these need to meet 5ns access times.
Chip has those delays finely tuned, and I doubt he would want to add anything in there at all.

Not quite correct based on the chip data sheets I have seen. Memory selection is typically done with nand gates to select rows and columns of an array of bits. The first row/column nand inputs are normally connected to the address bits so that address $0 will select row 0 column 0 of the memory array, and the next nand is connected to select the next row/column, and so on.

They do not have to be connected like that. The first row nand could be connected so that address 1, or 4, or any other address selects row 0. That allows hub memory (rom or ram) to start at any arbitrary address. This would not need an adder to offset the address, and would not affect speed since the address decoding for cog and hub would be independently done in the cog or hub.

Of course if hub memory does not start at address 0 it will need another address bit to access a full 512KB. The alternative would be to have hub ram be a bit less than 512KB. Might even have to be to make up for the extra space the added cog ram will need.

The case for Additional/Extended COG RAM (+2/4/6/8KB)

Comments