Why not more COG memory?

richaj45 · 2010-11-08 22:48

Hello:

I was reading about the Prop II and how much is being packed into the chip. It is pretty amazing. However i have come to the conclusion that the COGs can only do two thing well. A protocol engine to driver peripheral engines and language interpreters.

That is, all non assemble programs are written in some code that is interpreted. Like Spin,
Basic, Forth, etc. Even when the LMM is used there is still some kind of fiddling to know how to jump and call.

Now if the COG could access more than 512 longs maybe that could change. I do realize the current codding of instructions in the 32-bits does not allow for over 9-bits of access, 512 longs, sooooo what would be a good way, if i could make new silicon, to increase the COG address space?

The first idea is to make the instruction more than 32-bits wide and use the extra bits for source and destination address bit. Kinda redefining the length of an integer from 32-bit to maybe 40-bits. That would give 13-bits for source and destination addresses.

What other ways can people think of is my question.

cheers,
rich

jazzed · 2010-11-09 00:03

richaj45 wrote: »

What other ways can people think of is my question.

Let the source and destination fields allow for register indirect access.

Today, there are 2 modes for source fields. 1) immediate and 2) register.

1) Immediate (noted by #label as in mov dest,#label) allows for using 9 bit values in the source field. 2) Register allows for using values from the register specified by the 9 bit source value - the source points to the data with single indirection.

The new register indirect access mode would be similar to register access except double indirection would be used. So the source register would contain the address of the COG register to use for example.

This would be harder to program and cause all kinds of headaches. All the PASM library code would need to be ported because only 256 registers would be addressable. Immediate values become 8 bit, etc...

The question for me is: would you be willing to sacrifice common HUB memory to do this? One could potentially extend HUB out beyond the chip, but then the single chip solution option goes away.

--Steve

RossH · 2010-11-09 00:34

Jazzed's idea of register indirect addressing (at least for some opcodes) would be good, as it would also improve the Prop's ability to implement stacks and other complex data strucutures - something that now has to be done either by using a lot of instructions, or by using self-modifying code.

Another idea (which I have proposed before) that could be used in conjunction with this is to implement simple cog ram paging - i.e. allow the current 512 instructions to be paged out and a new 512 instructions paged in by using a simple instruction thjat updates a special "page" register. The special registers can appear overlaid on all pages, or only overlaid on page 0. This type of paging technique used to be fairly common - it requires some special coding (but not much) and is otherwise fairly simple to implement in both hardware and software.

Ross.

Batang · 2010-11-09 00:51

jazzed: "but then the single chip solution option goes away"

Considering that the prop needs to boot from an external eeprom it is not what you could call single chip solution.

I find it interesting that are a plethora of add on memory modules for the prop.

dMajo · 2010-11-09 01:00

I agree with RossHs second option.
The paged memory is the way to go. The special registers (IN/OUT(PIN),DIR) shall be overlaid on all the pages. There is no need to implement a new register address for the page and so waste a useful long address. It could be hidden. It's enough to implement the actually unused dest space of the JMP instruction
JMP <#>Page, <#>Address

Cluso99 · 2010-11-09 01:04

Batang: The reason we have a number of pcbs which add external memory to the Prop is because we have found additional ways to use the prop which were never intended by Chip (the designer).

Indirection will not work in the prop's current form (Prop 1 or II) because it requires yet another fetch (i.e. clock) which blows away determinism.

I would like to see at some stage banked memory on at least 1 cog. Maybe a Prop II.5 with 1 cog having a further 2KB (or more) by bank switching 1KB blocks (128 longs). I have suggested this could be "stolen" from an adjacent cog.

However, I do see that the Prop II will be so much faster, and that we will utilise other methods such as overlaying and LMM to overcome the restricted cog space. We do not yet understand how we can access the 512 bytes (?) of FIFO in each cog. Perhaps there are some smarts here to be exploited.

Batang · 2010-11-09 01:17

Hi Cluso99,

Would you be so kind to let me know what some of those additional ways are.

Gadgetman · 2010-11-09 03:13

Cluso99 wrote: »

I would like to see at some stage banked memory on at least 1 cog. Maybe a Prop II.5 with 1 cog having a further 2KB (or more) by bank switching 1KB blocks (128 longs). I have suggested this could be "stolen" from an adjacent cog.

No stealing!

Because if that happens, the adjacent COG must be 'shut off' or you risk overwriting one COG's program with another...

richaj45 · 2010-11-09 06:16

These are good ideas.
One of my thoughts was to build a smaller chip to replace the SX parts that are going out of business. In this scenario there would be might not be any hub memory buts just one or more cogs and each one would run very very fast.

I have noticed that the SX programmers have done amazing things with such limited resources on an SX but it is very fast chip. Just like a COG has limited resources but is also fast and is getting faster.

Also i was thinking of doing a COG on an FPGA, (my profession), for non-commercial use to see if it would make a great nano-controller.

Please give me a little more detail on how paging would work. Especially how does one page reach over to get variables on another page. Or maybe global variable are kept in a small global memory space. Like CLUT on the new prop that is used for global variables.

Any thought.
rich

Ale · 2010-11-09 06:23

dMajo:

JMP's dest space is used for CALL (to write the return address to the pseudo RET instruction).

The only way would be to reduce the flags/conditions field. Just 8 combinations of flags and removing the write result flag can get you 1024 positions. Or maybe registers are only allowed in the first/last 512 positions and the rest is accessed by 2 separate instructions (JMP/CALL). Extending the range as much as you want.

Cluso99 · 2010-11-09 09:39

Gadgetman wrote: »

No stealing!

Because if that happens, the adjacent COG must be 'shut off' or you risk overwriting one COG's program with another...

Obviously it would be the programmers responsibility to ensure this although perhaps when enabled, the cog could be locked out. The same opinion could be said to be true for hub ram and yet we survive. Just because a solution has a downside doesn't mean it shouldn't be considered/implemented providing the upside is worth it and IMHO it sure would be worth having a single super-cog.

richaj45 said...
Also i was thinking of doing a COG on an FPGA, (my profession), for non-commercial use to see if it would make a great nano-controller.

A few of us have played with this before. Once you get into it you will see how smart Chip is in how he has used the silicon. I have an Avnet Spartan 3 pcb (Xilinx) and used Verilog.

Batang: There are many... Emulations such as heater's ZiCog and ZPU and Pullmoll's emulations, RossH's C compiler can use XMM (external memory), Bill has caching programs, etc. These are all microcomputer not microcontroller solutions.

Electronegativity · 2010-11-09 11:19

In the "Chat with Chip" thread above there is mention of direct 32 bit pipelining between Cogs in the Prop2.
I'm not exactly sure what that means, but it seems like it would give Cogs access to each other's memory without having to go through the Hub.

Phil Pilgrim (PhiPi) · 2010-11-09 11:24

As I understand it, the "pipelining" is a communications channel which can be used in lieu of signaling with external pins. It's not a DMA scheme. The cog sharing its memory would have to have a server program to interpret requests and to either store or retrieve data. It would be quicker just to use the hub for something like that.

-Phil

markaeric · 2010-11-09 12:12

Speaking of "32-bit pipes between all cogs" I assume what chip meant was a 32-bit wide bus accessible to all cogs, rather than individual 32-bit pipes between all cogs (which was kind of what it sounded like)?

One way or another, this was one of the new pieces of info I was most excited about for some reason.

Cluso99 · 2010-11-09 18:08

The 32 bit pipe has been discussed on another thread.

IIRC, it is effectively a 32bit I/O register without the pins. There are some special instructions to access them. So it is not really a pipe.

Batang · 2010-11-09 21:00

Clusso99: "There are many... Emulations such as heater's ZiCog and ZPU and Pullmoll's emulations, RossH's C compiler can use XMM (external memory), Bill has caching programs, etc. These are all microcomputer not microcontroller solutions"

It seems to me that if you want to program the prop in C you need to add memory and that in turn would to a loss of IO - zero sum game.

Interesting..............

Dr_Acula · 2010-11-09 21:29

You can program in C without external memory. I am using Catalina and also writing an IDE to make it much simpler to use. Just an hour ago I added a checkbox where you can compile using internal memory, or compile using external memory.

There are advantages to both. For big programs, use external memory. But for a quick and simple test, an internal program downloads about three times faster because it is smaller.

It isn't necessarily zero sum adding external memory. You will use up pins, but add a latch for 50c and you can have 8 more output pins. Ditto input pins. Dollar per pin, logic chips are cheaper than propeller chips.

Batang · 2010-11-09 22:03

Dr_Acula, whilst I agree with that it would depend on whether you are building something as "one off" project or mass producing something.

In the case of mass production you would have to factor the cost of the additional parts, the increased size PCB and loading costs as well as the additional PCB layout time.

The only benefit being you can do it C.

It appears to me the best fit for coding the prop is spin/pasm.

RossH · 2010-11-09 22:44

Batang wrote: »

It appears to me the best fit for coding the prop is spin/pasm.

For PASM, this is the definitely case - like any micro, the native assembly language is always going to be the fastest and most flexible option. But usually also the hardest and most expensive way to program it. This is certainly true on the Propeller, where you need all sorts of tricks as soon as you get beyond 496 instructions (e.g. LMM, overlays, multiple cogs etc).

SPIN, is also a natural fit for the Propeller if you don't mind losing a heap of speed - and also don't mind coding quite a lot of your application in PASM as a consequence. SPIN was designed with the Propeller architecture in mind, and would always be a good choice - unless you ever plan to use any other micro, since SPIN unlikely ever to be ported to any another (except by the lunatics in this forum!)

C is a perfectly acceptable alternative to SPIN and PASM. It falls somewhere in between the two. C programs will be larger than SPIN (perhaps several times larger) but also many times faster. C will not as fast as PASM (it might be about 1/4 as fast) and may be slightly larger - but you are also not restricted to 496 instructions.

Best of all, if you already have your application software (or some part of it) written in C, or might one day want to port your application to another micro, you won't need to rewrite it.

Ross.

dMajo · 2010-11-10 01:30

Ale wrote: »

dMajo:

JMP's dest space is used for CALL (to write the return address to the pseudo RET instruction).

The only way would be to reduce the flags/conditions field. Just 8 combinations of flags and removing the write result flag can get you 1024 positions. Or maybe registers are only allowed in the first/last 512 positions and the rest is accessed by 2 separate instructions (JMP/CALL). Extending the range as much as you want.

From what I have seen (also before your post) both jmpret and call have their result flag set in opposition to the jmp.
I thought that founding on this info the dest value can be otherwise interpreted in the two scenarios (call,jmpret(RetInstAddr, <#>DestAddress) || jmp(<#>Page,<#>Address)

Pls someone correct me if I am wrong

evanh · 2010-11-11 20:33

Can't help but nit-pick a little here:

richaj45 wrote: »

... I have come to the conclusion that the COGs can only do two thing well. A protocol engine to driver peripheral engines and language interpreters.

That is, all non assemble programs are written in some code that is interpreted. Like Spin, Basic, Forth, etc. Even when the LMM is used there is still some kind of fiddling to know how to jump and call.

Well, LMM is not really in any sense interpreted. And there is a number of languages that use LMM as their model of execution so you are getting it's benefits thrown in just by using them.

Look at this way; Machine code is best for tight I/O and filter blocks. That has always been true on all platforms. Top performance is only needed for certain operations - which usually ends up being put in hardware on single core designs. LMM can perform the desired general computing function while calling on raw speed for critical parts only.

richaj45 wrote:

Now if the COG could access more than 512 longs maybe that could change. I do realize the current codding of instructions in the 32-bits does not allow for over 9-bits of access, 512 longs, sooooo what would be a good way, if i could make new silicon, to increase the COG address space?

Ignoring instruction formatting, there is the problem of physically connecting more RAM to each Cog. The RAM takes space and you are multiplying that by eight for the eight Cogs. As it is, the Prop2 Cog RAM blocks (quad ported) are probably twice the logical size of the Prop1 RAM blocks (single ported).

Chip has estimated a 7.4x logical size increase for each Cog already. Which is the reason why Hub RAM has shrunk to 128KB instead of the earlier predicting easily 256KB. Part of that growth will be the FIFO and CLUT that each Cog has now gained.

Likewise, and this is not directly your question, things like bank-switching and more shadow registers within the Cogs is a bit of a no go. Such things, just like existing Cog RAM, generally have to be wired to individual Cogs. It's not a global shared resource like the Hub.

Evan

richaj45 · 2010-11-12 19:56

Thanks for all your feedback.

I think i don't understand the LMM model well enough.
I get that instructions are read form hub memory an then executed
but what it they are a jump or call instruction? In that case how does the LMM know
were to go in the hub memory to get the next instruction?

Thanks again.

rich

RossH · 2010-11-12 20:05

richaj45 wrote: »

Thanks for all your feedback.

I think i don't understand the LMM model well enough.
I get that instructions are read form hub memory an then executed
but what it they are a jump or call instruction? In that case how does the LMM know
were to go in the hub memory to get the next instruction?

Thanks again.

rich

Hi rich,

As you've realized, you can't simply execute a normal PASM JMP instruction - instead, the LMM kernel has to keep an internal register as the Program Counter (PC) that points into Hub RAM. Then a jump equivalent is to simply update this register - e.g.

mov PC, #XXX    ' equivalent to JMP to XXX
   sub PC, #100    ' jump 100 locations back (e.g. in a loop)

To do a longer jump, you normally write a JUMP function, which expects the jump location to follow the instruction immediately:

jmp #JUMP           
   long @JUMP_ADDR  ' <-- this will never be executed as an instruction!

Similarly for function calls:

jmp #CALL
   long @CALL_ADDR  ' <-- the LMM Kernel must skip over this!

Mike Green · 2010-11-12 20:08

Rich,
The LMM "interpreter" can't execute jump or call instructions. These are interpreted in a sense. Relative jumps are done by using an ADD or SUB instruction to change the interpretive "program counter" (where the next instruction is fetched from hub RAM. These can be made conditional so that takes care of conditional jumps. Calls are done by using a subroutine call to a routine within the LMM "interpreter" that saves the interpretive "program counter" on a stack, either within the "interpreter"'s cog or in hub memory. The address of the next instruction is taken from the next long word in memory. Long jumps (greater than 127 instructions forward or backward) and returns are also done by subroutine call.

Bill Henning · 2010-11-13 07:05

There is a better/faster JMP..

rdlong pc,pc ' no need for a micro code routine! PC will already be pointing to the data long

Even better, as long as the top 9 bits of the LMM address are 0, it is treated like a NOP, so you can use all of the conditional prefixes on the rdlong for conditional jumps

RossH wrote: »
Hi rich,
jmp #JUMP           
   long @JUMP_ADDR  ' <-- this will never be executed as an instruction!
Similarly for function calls:

[/code]

Dave Hein · 2010-11-13 08:24

Keep the cog memory at 512 longs, but increase the size of the program counter. Reserve a hub RAM address to read and write the program counter. Long addresses greater than 511 would reference hub RAM. Long addresses greater than 32767 would reference external SDRAM. Add small hardware caches for hub RAM and external SDRAM access so that instructions can be prefetched to reduce stalling on linear code.

RossH · 2010-11-13 14:12

Bill Henning wrote: »

There is a better/faster JMP..

rdlong pc,pc ' no need for a micro code routine! PC will already be pointing to the data long

Even better, as long as the top 9 bits of the LMM address are 0, it is treated like a NOP, so you can use all of the conditional prefixes on the rdlong for conditional jumps

Hi Bill,

Yes, "rdlong pc, pc" is all the JUMP routine does for LMM - but for XMM it can't work the same way, so in Catalina I decided to leave the JUMP kernel entry point in place for both - at the cost of an extra few clock cycles per LMM jump.

Limiting code addresses to 23 bits (top 9 bits zero) is clever, but it leaves you with a code space of only 8 Mb - and there are already XMM boards with more RAM than that available. I suppose you could support both types of jump, depending on whether the destination address fits into 23 bits or not. The trouble is you don't know this at compile time (only at assembly/link time).

However, both of these could easily be implemented in a post-compilation optimizer, so I'll have a look at them next time I'm working on the Catalina Optimizer.

Thanks,

Ross.

Bill Henning · 2010-11-13 14:25

You are absolutely correct, for XMM you do need a routine... can't believe I forgot.

FYI, you can keep the top 9 bits clear, AND get 32MB addressing - instructions must be long aligned, so just shift the 23 bit address left 2 bits

If you don't mind the address being more cryptic, just leave bits 23..25 as 0 (ZCR), and

jump
   rdlong addr,pc
   rol      addr,#8
   mov   pc,addr
   jmp    xmm_next

You can have 31 bit addressing!

How?

with ZCR=0, the instruction executes, but has no result and no effect on flags!

the ROL addr,#8 moves the top eight bits (two of which, ZC are 0) into the eight lowest bits of addr, thereby providing a save 31 bit address - allowing up to 2GB of address space - and the 'jmp #jump' can still have the conditional flags, as if the jump is not taken, the instruction will have no effect! (ok, hub instructions will be slower, but no effect!)

RossH wrote: »

Hi Bill,

Yes, "rdlong pc, pc" is all the JUMP routine does for LMM - but for XMM it can't work the same way, so in Catalina I decided to leave the JUMP kernel entry point in place for both - at the cost of an extra few clock cycles per LMM jump.

Limiting code addresses to 23 bits (top 9 bits zero) is clever, but it leaves you with a code space of only 8 Mb - and there are already XMM boards with more RAM than that available. I suppose you could support both types of jump, depending on whether the destination address fits into 23 bits or not. The trouble is you don't know this at compile time (only at assembly/link time).

However, both of these could easily be implemented in a post-compilation optimizer, so I'll have a look at them next time I'm working on the Catalina Optimizer.

Thanks,

Ross.

RossH · 2010-11-13 14:32

Bill Henning wrote: »
FYI, you can keep the top 9 bits clear, AND get 32MB addressing - instructions must be long aligned, so just shift the 23 bit address left 2 bits

If you don't mind the address being more cryptic, just leave bits 23..25 as 0 (ZCR), and
jump
   rdlong addr,pc
   rol      addr,#8
   mov   pc,addr
   jmp    xmm_next
You can have 31 bit addressing!

How?

with ZCR=0, the instruction executes, but has no result and no effect on flags!

the ROL addr,#8 moves the top eight bits (two of which, ZC are 0) into the eight lowest bits of addr, thereby providing a save 31 bit address - allowing up to 2GB of address space - and the 'jmp #jump' can still have the conditional flags, as if the jump is not taken, the instruction will have no effect! (ok, hub instructions will be slower, but no effect!)

Bill, were you born with a mind this devious, or did you have to work to achieve it?

Bill Henning · 2010-11-13 14:44

Thank you!

*blush*

Well, I studied Computer Science at Simon Fraser University, but frankly, I learned much more on my own ... I grew up "hacking" on Apple ]['s, Atari 400's, Amiga's etc... in assembly language. Being a computer scientist just gave me official creds and a piece of sheepskin

This technique only gives 29 bits of byte addressable memory, so only 512MB is byte addressable (ROL address, #6)

RossH wrote: »

Bill, were you born with a mind this devious, or did you have to work to achieve it?

kuroneko · 2010-11-13 16:41

Bill Henning wrote: »

You can have 31 bit addressing!

How?

with ZCR=0, the instruction executes, but has no result and no effect on flags!

Careful with that one. wrxxxx is encoded with R = 0 (rdxxxx uses R=1) and that may have drastic results.

Why not more COG memory?

Comments