An idea for p3(?) - 2-bank cog ram

pik33 · 2013-10-10 23:22

The idea is:

- make 2 banks of 512 32-bit words COG RAM
- add "bankinit" and "bankswitch" instruction

After coginit the cog starts with bank 0, then you can use bankinit 1,addr. It loads 512 words from the hub ram to the bank 1 in the background, without stopping a cog. Then you can use bankswitch 1, after which the next instruction will be executed from the bank 1. Then you can use bankinit 0, addr and bankswitch 0.

LMM code can be executed at near full speed with this. Priority logic will be needed for hub access, so cog executin bankinit have still access to the hub ram.

Heater. · 2013-10-10 23:55

Just say no to banked memory.

Remember all the fun we had with Intel 8086 64K segments? Remeber the joys of switching 64K blocks of RAM into high memory of your PC from memory expantion cards like the LIM standard http://en.wikipedia.org/wiki/Expanded_memory. The shere bliss of architecting large programs with overlaying linkers for such systems.

Or what about the ease of programming those little micros with banked memory like PIC and such.

You should be having cold sweats by now.

We already have that huge extra storage area, used to be called stack RAM I think it has been renamed since.

Switching memory banks will not play well with the multi-tasking on a COG.

Just say no to banked memory.

The proper way to do this for P3 is to make the COGs into 48 or 64 bit machines, which allows the 9 bit source and dest fields to be greatly widened and hence for huge potential COG register spaces. As much as will fit on whatever silicon process is available to the Prop III.

Sapieha · 2013-10-11 00:04

Hi pik33.

I posted that idea some Year ago --- Thinking it was good.

Now as I see how it is ---- I made us STACK memory --- And with that COG are more powerful that if it had second COG-space.

So I'm with Heater --- Just say no to banked memory.

pik33 wrote: »

The idea is:

- make 2 banks of 512 32-bit words COG RAM
- add "bankinit" and "bankswitch" instruction

After coginit the cog starts with bank 0, then you can use bankinit 1,addr. It loads 512 words from the hub ram to the bank 1 in the background, without stopping a cog. Then you can use bankswitch 1, after which the next instruction will be executed from the bank 1. Then you can use bankinit 0, addr and bankswitch 0.

LMM code can be executed at near full speed with this. Priority logic will be needed for hub access, so cog executin bankinit have still access to the hub ram.

pik33 · 2013-10-11 00:34

Heater. wrote: »

Remember all the fun we had with Intel 8086 64K segments? Remeber the joys of switching 64K blocks of RAM into high memory of your PC from memory expantion cards like the LIM standard http://en.wikipedia.org/wiki/Expanded_memory. The shere bliss of architecting large programs with overlaying linkers for such systems.

Yes, I remember. I programmed these real mode x86 processors in asm... Good old times... and a lot of joy... Then all joy disappeared with these new 32-64 bit processors and overweighted operating systems... until I found the Propeller which can be still programmed in asm and you can still output pixels to the VGA or sound to the headphones in the way you (and not the OS) want.

...but this has nothing to do with an idea of 2 switched cog ram banks. The main idea was: how to get rid of 8000 clocks needed to coginit.

The main problem with lmm is: you have to load from the hub ram continuously and this needs time. The cog cannot do anything when it does rdlong.
The other solution for LMM is: coginit->execute-> coginit-> execute - this is even worse, with the propeller 1 you have to wait over 8000 clocks.

Having 2 banks you can "coginit" unused bank in the background and then instantly switch the bank.

Another use of this: when you have to have a lot of peripherals not to be used at the same time. Simply switch a bank and you have another peripheral ready to go without waiting for coginit.

Heater. · 2013-10-11 01:19

I don't see how a banking solution helps with LMM.

I'm not sure but I think on P2 it is possible to have a LMM loop executing "big" code from HUB and a task or two running in the same COG. Those tasks would make use of the time the LMM loop is waiting for HUB.

One would never do LMM with COGINIT, but Cluso has an overlay mechanism that pulls chunks of code into HUB and executes them. Load time depends on the overlay size. Can be faster than LMM for some code.

The C compiler has "fcache" which can pull small loops into COG for natve speed execution. Turns out the C language version of my FFT usinging fcache for the main loop runs nearly as fast as the hand crafted assembler version!

Having a lot of peripherals "sleeping" in banks is an iteresting idea. Not sure how many projects would require it. It suffers for that "banking syndrome" you can't use both at the same time and have to architect your program around that limitation.
The PII multi tasking gives us the ability to put two or more different peripherals into a COG easily (As long as they fit the space) with the benifit that they are all available at the same time.

Gadgetman · 2013-10-11 01:59

Just wondering...

Is the COG supposed to start executing the code in the second bank the same way it starts executing the code in the main bank, or?

I see that it has potential for faster switching from one program to another, but it also completely destroys the possibility of multitasking on that COG.

If the COG continues executing from the 'next PC address' instead of resetting, I guess it could be used as some sort of overlay for large subroutines, but frankly, it would be much cleaner to use the already existing LMM routines.

Dave Hein · 2013-10-11 06:05

Another way to handle larger cog RAM is increase the size of the program counter, and access cog locations beyond 511 using index registers. If the program counter were 10 bits wide instead of 9 programs could be 1024 instruction long. Instead of doing a jump as "jmp #label" You would do a jump as "jmp label_address, where the location at label_address would contain the address of "label". Code that is located within the first 512 cog locations could work as it does on P1, and code after the first 512 location would need to used the indirect jump.

If we add some index registers we could access data locations beyond 511. I would suggest using the same method for indexing that P2 will have. Most variables could be defined within the first 512 longs of memory so that they could be accessed like they currently are in P1. Additional variables, arrays and tables could be stored in the high end of cog memory, and accessed through the index registers.

Bill Henning · 2013-10-11 06:41

WC & WZ are not needed for any of the jump instructions, using them for two extra address bits would allow for 2048 cog locations (for jump destinations)

with 2048 cog locations, each of the four tasks could have 512 longs (minus SFR's)

Yanomani · 2013-10-11 07:00

pik33 wrote: »

... until I found the Propeller which can be still programmed in asm and you can still output pixels to the VGA or sound to the headphones in the way you (and not the OS) want.

...but this has nothing to do with an idea of 2 switched cog ram banks. The main idea was: how to get rid of 8000 clocks needed to coginit.

The main problem with lmm is: you have to load from the hub ram continuously and this needs time. The cog cannot do anything when it does rdlong.
The other solution for LMM is: coginit->execute-> coginit-> execute - this is even worse, with the propeller 1 you have to wait over 8000 clocks.

Having 2 banks you can "coginit" unused bank in the background and then instantly switch the bank.

Another use of this: when you have to have a lot of peripherals not to be used at the same time. Simply switch a bank and you have another peripheral ready to go without waiting for coginit.

pik33

Life becomes easier, with COGINIT now taking 1,016 clocks to load and start a Cog, and also that we have so many ways to sincronize the execution between any number of running Cogs.

Unless some whole Cog taking task, actually empties all its resources and executing options, e.g. a straight vector of almost 500 NOPS (plus a trailing COGINIT, to start a new set, when the actually running has totaly done) in far less than that 1,016 clock period, you always can have two Cogs, practically continuing one's action after the other, if you need do it that way.

In an almost eternally controlable flip-flop, or even a useful playground balance scheme.

As for the RDLONGs, we now have the QUADS, that seamlessly bring us four longs, at the same time, with far less overhead.
There are lots of useful changes and enjoyment for coders, even the seasoned ones, like us!

Yanomani

Cluso99 · 2013-10-11 07:03

Bill Henning wrote: »

WC & WZ are not needed for any of the jump instructions, using them for two extra address bits would allow for 2048 cog locations (for jump destinations)

with 2048 cog locations, each of the four tasks could have 512 longs (minus SFR's)

Excellent point Bill ! It's also possible to expand the clut to 512 longs. And we would all like more hub ram. That should keep the 32bit P3 alive.

Yanomani · 2013-10-11 07:07

I'm wondering....

If we can have a mechanism, similar to COGINIT, to automatically load the Stack Ram....
Even if it only works at Cog Startup, as a choice in the manner we code the COGINIT instruction, it can be very useful.

One could load its routines, and tables, in a single operation....

Yanomani

Bill Henning · 2013-10-11 07:12

Thanks... actually the clut could be much bigger, due to the pointers & indexed mode

Cluso99 wrote: »

Excellent point Bill ! It's also possible to expand the clut to 512 longs. And we would all like more hub ram. That should keep the 32bit P3 alive.

Yanomani · 2013-10-11 07:14

Bill Henning, Cluso99

Why don't having it all, at the same time?

Bill Henning
WC & WZ are not needed for any of the jump instructions, using them for two extra address bits would allow for 2048 cog locations (for jump destinations)

with 2048 cog locations, each of the four tasks could have 512 longs (minus SFR's

Cluso99
Excellent point Bill ! It's also possible to expand the clut to 512 longs. And we would all like more hub ram. That should keep the 32bit P3 alive.

Yanomani
If we can have a mechanism, similar to COGINIT, to automatically load the Stack Ram....
Even if it only works at Cog Startup, as a choice in the manner we code the COGINIT instruction, it can be very useful.

One could load its routines, and tables, in a single operation....

Yanomani

Bill Henning · 2013-10-11 07:26

Yanoman,

    setptra   hubaddress

    reps      #256,#2
    setspa    #0
    rdlongc   temp,ptra++
    pusha     temp

loads the CLUT in just over 512 clock cycles, as fast as hardware could load it - I use this snippet in a far bit of my code - a hardware loader would not be any faster.

Yanomani wrote: »

I'm wondering....

If we can have a mechanism, similar to COGINIT, to automatically load the Stack Ram....
Even if it only works at Cog Startup, as a choice in the manner we code the COGINIT instruction, it can be very useful.

One could load its routines, and tables, in a single operation....

Yanomani

Bill Henning · 2013-10-11 07:27

We can in P3 ... this is too big a change, too late, for P2

(and it would use a LOT more transistors)

Yanomani wrote: »

Bill Henning, Cluso99

Why don't having it all, at the same time?

Yanomani

Yanomani · 2013-10-11 11:23

Bill Henning wrote: »
Yanoman,
    setptra   hubaddress

    reps      #256,#2
    setspa    #0
    rdlongc   temp,ptra++
    pusha     temp
loads the CLUT in just over 512 clock cycles, as fast as hardware could load it - I use this snippet in a far bit of my code - a hardware loader would not be any faster.

Thanks Bill Henning

Really nice use of the QUADs, good lesson to me.

Only to enable me to test if I'm in sync, with the actual operations that P2 does:

Prior to entering you Stack Ram loader routine, a SETQUAD or SETQUAZ was used, if the quads were pointing to some executable code area, or any other register locations being used by the program that is running into the Cog.

* The following paragraph only applies, if a SETQUAD or SETQUAZ does not implies a simultaneous CACHEX, since its not described, nor denied on last version of Chip's documents.

Either way, to be sure of cache contents, also an earlier CACHEX instruction should have been executed, to invalidate any possible remaining data, resulting from any previously cached operations, that were not taken in multiples of four longs, or any equivalent count, to ensure that any previously cached data, has exhausted.

Yanomani

Yanomani · 2013-10-11 11:30

Chip,

Despite the last documents, on the subject of QUAD-retated instructions operation, doesn't describe it, could you confirm or deny, if a SETQUAD or SETQUAZ instruction execution, dos inherently implies a CACHEX-like operation too?

Yanomani

Yanomani · 2013-10-11 12:03

Bill Henning wrote: »

We can in P3 ... this is too big a change, too late, for P2

(and it would use a LOT more transistors)

Bill Henning

For a moment only, I'd believed that it was totaly possible, to create a derivative instruction from COGINIT, to divert Hub gathered data destination and transfer lenght.

I'm not (yet) asking for any reverse action, some kind of COGSTOP (or COGFLUSH), with simultaneous transfer of COG memory contents, back to HUB memory.

For P3, even some exchange-wise operations, between COG and HUB memory, can be devised, if they show as being useful to enhance total system performance.

All of this, is not just a matter of exhausting any available opcodes, left on the actual map. It's a matter of giving the maximum eficiency and enabling the creative ingenuity, of actual and future Propeller programmers.

As time passes, and targeted processes and costs enable it, bigger and faster data paths can be crafted, to diminish latency and data transfer time, between COGs and the HUB, and even at the inter-COG interface too.

Perhaps, if nothing irremediably crashes, inside anyone of us, we'll be there to enjoy, all those wonderfull caracteristics.

And code, code, code, code, code...

Yanomani

jmg · 2013-10-11 12:43

Bill Henning wrote: »

WC & WZ are not needed for any of the jump instructions, using them for two extra address bits would allow for 2048 cog locations (for jump destinations)

with 2048 cog locations, each of the four tasks could have 512 longs (minus SFR's)

Good idea, but is that enough of a increase ?

A problem with COG RAM, is it very costly in silicon, as it is multi-port.
Adding cheaper RAM, that can be Array or Table Storage or read-only code, can give you more RAM in the same area.
It gives you more working space, by moving stuff out of valuable memory, into cheaper memory.

If the process shrink allow shiploads of multi-port memory, then an idea already used in other register - register cores, is a register frame pointer.
This allows some granular placement of the opcode and index reach within the larger total space, and has advantages over banking.

Using this, you can partially overlay areas, to allow transparent passing of parameters between threads.
The simpler Indexed reads, would be able to reach the whole space, multi-port RMW would be within an opcode-reach

cgracey · 2013-10-11 15:04

Yanomani wrote: »

Chip,

Despite the last documents, on the subject of QUAD-retated instructions operation, doesn't describe it, could you confirm or deny, if a SETQUAD or SETQUAZ instruction execution, dos inherently implies a CACHEX-like operation too?

Yanomani

No, only CACHEX causes the cache to be invalidated.

Yanomani · 2013-10-11 15:20

cgracey wrote: »

No, only CACHEX causes the cache to be invalidated.

Thanks Chip

This seems enough to me, for the moment.

Yanomani

P.S. Forget about the repeated question at the other thread. :thumb:

jazzed · 2013-10-11 15:32

IMHO (probably to be ignored LOL),

It would be far better to make it possible for all COGS to be able to read any global RAM location simultaneously and allow writing via round robin or strict priority dependent on the user's needs.

I know it is difficult, but it is worth investigating further.

pjv · 2013-10-11 15:53

jazzed wrote: »

IMHO (probably to be ignored LOL),

It would be far better to make it possible for all COGS to be able to read any global RAM location simultaneously and allow writing via round robin or strict priority dependent on the user's needs.

I know it is difficult, but it is worth investigating further.

Yet another of those "out-of-the-box" ideas....... I just love these !

Cheers,

Peter (pjv)

Bill Henning · 2013-10-11 16:04

That would require eight-ported hub, nine if there was also a write port... so roughly 9x the number of transistors.

jazzed wrote: »

IMHO (probably to be ignored LOL),

It would be far better to make it possible for all COGS to be able to read any global RAM location simultaneously and allow writing via round robin or strict priority dependent on the user's needs.

I know it is difficult, but it is worth investigating further.

brucee · 2013-10-11 16:05

Actually Hub RAM is not multiport, it is time division multiplexed. Its access time is at 200 MHz and each clock a COG gets a read or write

Multiport RAM would be quite large, I don't know if I've ever seen one that had that many ports.

Does Hub RAM buffer writes? That way a COG wouldn't have to wait for the first write. Not a big saving, but it is something, the rule of thumb for memory is that there are typically 3 or 4 reads for every write.

Seairth · 2013-10-11 16:20

Bill Henning wrote: »

That would require eight-ported hub, nine if there was also a write port... so roughly 9x the number of transistors.

Could the the hub be internally clocked at 8x the system clock, such that each cog (still in the round-robin order) could access the hub every regular clock cycle (plus maybe a few cycles for setup/copy)? Or maybe allow the operations to be asynchronous? In that case, the actual hub access would still be every 8 cycles, but the COG pipeline wouldn't stall.

kwinn · 2013-10-11 17:13

brucee wrote: »

Actually Hub RAM is not multiport, it is time division multiplexed. Its access time is at 200 MHz and each clock a COG gets a read or write

Multiport RAM would be quite large, I don't know if I've ever seen one that had that many ports.

Does Hub RAM buffer writes? That way a COG wouldn't have to wait for the first write. Not a big saving, but it is something, the rule of thumb for memory is that there are typically 3 or 4 reads for every write.

Did I miss something in one of the P2 threads? I was under the impression that HUB RAM was time division multiplexed between the eight cogs and that COG RAM would not be multiported since it serves only the one cog it is connected to.

jmg · 2013-10-11 17:28

kwinn wrote: »

COG RAM would not be multiported since it serves only the one cog it is connected to.

COG ram is multiported in the sense that any opcode can fetch from PC address, read one register and write another, in one clock. The user may think of it as just COG RAM, but in silicon, it needs all those ports to work.

cgracey · 2013-10-11 18:21

Seairth wrote: »

Could the the hub be internally clocked at 8x the system clock, such that each cog (still in the round-robin order) could access the hub every regular clock cycle (plus maybe a few cycles for setup/copy)? Or maybe allow the operations to be asynchronous? In that case, the actual hub access would still be every 8 cycles, but the COG pipeline wouldn't stall.

Reading the main SRAM takes as much time as a cog clock cycle, so we cannot speed it up even 2x.

evanh · 2013-10-11 19:52

Ah, the constant tugging of local vs global. Globals are so easy conceptually ... but they must be shared, and sharing cleanly makes things complicated. Bussing, point-to-point, muxing, buffering, caching, mutexes, signals, messages, objects.

kwinn · 2013-10-11 20:26

jazzed wrote: »

IMHO (probably to be ignored LOL),

It would be far better to make it possible for all COGS to be able to read any global RAM location simultaneously and allow writing via round robin or strict priority dependent on the user's needs.

I know it is difficult, but it is worth investigating further.

Not only would this be difficult to do and add a lot of transistors, it would create a lot of software problems. How do you control what cogs can read or write to a memory location if they are using it to communicate?

An idea for p3(?) - 2-bank cog ram

Comments