No. There is a 32-bit XCH (exchange) system that can route 32-bits per clock between/among any/all cogs.
I must have missed this one, or thought it used port D. Another brilliant feature that will notonly be faster, but avoids hub slots, and saves port D too.
I am not at all worried with the number of instructions. Many will only be usedfor special cases. What will make this easy is lots of sectionalised lists of instructions (such as a list of bit(s) manipulation instructions, etc)
I must have missed this one, or thought it used port D. Another brilliant feature that will notonly be faster, but avoids hub slots, and saves port D too.
I am not at all worried with the number of instructions. Many will only be usedfor special cases. What will make this easy is lots of sectionalised lists of instructions (such as a list of bit(s) manipulation instructions, etc)
I agree that categorized lists will be very useful. I'm tired of manuals these days that just give you an alphabetical list of "things" that include instructions, condition codes, pseudo-ops, and other random things all sorted together. Makes it almost impossible to find something unless you know it's name.
I've been thinking about how a cog has 512 registers.
I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.
Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.
I must have missed this one, or thought it used port D. Another brilliant feature that will notonly be faster, but avoids hub slots, and saves port D too.
I am not at all worried with the number of instructions. Many will only be usedfor special cases. What will make this easy is lots of sectionalised lists of instructions (such as a list of bit(s) manipulation instructions, etc)
My mistake!!! It DOES use Port D. I'm getting snow blind.
What might makesense with P2 is comms between cogs where we use hub fifos (FullDuplexSerial, etc), we also set a bit in port D to indicate a byteis available, and the cog waitpeq on this, ratheer than using a hub slot in a tight loop, freeing the slot.
Only time will tell. Clearly, neither of these were available with P1.
I've been thinking about how a cog has 512 registers.
I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.
Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.
I've been thinking about how a cog has 512 registers.
I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.
Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.
BUT is it possible to made COG's space be addressed as 2 separate 256 Long's
And in HUBEXEC mode reuse one of 256 part of COG as that registers You NEED?
I've been thinking about how a cog has 512 registers.
I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.
Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.
While I look forward to the new world of "HUB exec", I still see a need for 512 long cog ram code. This keeps the P1 concept intact for users
making the transition from P1 to P2. For PURE performance nothing beats good old tight PASM code for speed.
On the subject of the P2 manual/app notes, I put my hand up to volunteer contributing to such a document. When we have FPGA and silicon to play with
I think you will see a flurry of useful code and notes burst onto the forum. I know I have a few things on the go and I know I am not alone here.
"Build it and the docs will come"
Edit: 512 longs is still very useful for multi-tasking apps which in some cases have power saving possibilities.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.
Sounds like an exercise for future in-depth FPGA investigation ... with a Prop3 in mind.
I've been thinking about how a cog has 512 registers.
I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.
Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.
Interesting.
If there was no Prop1, and no threads, that might make sense.
The presence of P1 rather dictates a minimum of 512, and Threads mean users will tend to pack more into one COG, so the extra is not unused nearly as frequently as it is in P1
A hybrid solution could be to make some of the 512, 2 port memory, thus shrinking the die size but keeping the total 512.
Code fetch and Load/store work in the 2 port memory, but not the full dual-operand opcodes.
- but I'd keep that sort of change for drastic cases like only if the die fails to fit, as it makes users take more care in the memory map, and would make porting P1 code harder.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering.
I can think of several things worth doing without HUBEXEC that would benefit from 512, particularly when both the algorithm and data need room in the cog.
I've been thinking about how a cog has 512 registers.
I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.
Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.
Please if you can manage it don't shrink the COG register space down to 256. Having more than 256 COG LONGs is still going to be very useful in cases where you might want to simply port over existing P1 COG code that already needs more room to run, or you need fast byte addressable look up tables plus some extra COG space for other data/instructions or other thread's code, and you can't spare the AUX RAM for this as you need it for other purposes such as a call stack or video buffer etc. Having 512 longs on P1 (or really 496) was always handy in 256 entry LUT cases, even though you sometimes needed to be aware of the unusual 9 bit D/S field encodings in the instructions.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed ....
I suspect that routing is a problem now because of % usage.
I like Sapieha's idea of choosing how half of COG memory is used based on HUBEXEC mode and COG mode. That should remove the need for memory we wouldn't otherwise see anyway, will allow COG memory to stay 512 longs, and should allow for a bigger cache in HUBEXEX too (very desirable). That's a win-win-win situation.
One of the outcomes of slot sharing, if we get it, is likely to be pressure to not poll the hub in a tight loop unless really needed.
We will likely want to come up with some good non-Hub methods of signaling between cogs so data only needs read from the hub when ready instead of being constantly polled using hub reads.
One approach would be to use PortD for signalling. When One COG sets a particular bit, another COG can efficiently perform a HUBOP.
Ugh, 256 longs would make a ton of existing code impossible on the P2. Just because *1* thread can run full speed when the stars are aligned, doesn't mean we all want that handicap.
There is just so much code that requires 512 longs to be able to run, period. When you absolutely must have deterministic execution, in-COG is the only solution.
I've been thinking about how a cog has 512 registers.
I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.
Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.
LOL! I have a draft blog entry suggesting this very thing! Here are some additional thoughts (from that entry, now that I won't be posting it):
This makes sense because its likely that the only code that will still be running in cog mode will be self-modifying instructions and small, timing critical code blocks. And with the ease of switching between modes, code organized this way will be fairly straight forward.
If, for now, you keep 9-bit addressing, you can still use the upper 256 addresses to access special-purpose registers (INDx, DIRx, etc). No more "shadow" registers! In fact, you could even revisit PINx (just kidding!).Or spread PortD to 8 separate 32-bit ports where each cog has one output port and seven input ports wired to the other cogs (not kidding!). Anyhow, you get the idea.
Several response to the 256 vs 512 registers question suggested this would harm tasks. I thought it was going to be possible for all of the tasks to be in hub-mode at the same time. If that's the case, I don't see the problem.
Also, for those commenting about writing deterministic code, how often does the entire cog code need to be deterministic, as opposed to just specific segments/routines? I'm not saying the issue is being exaggerated, I'm just asking if it is.
I don't think we're going to shrink it. I think I'd feel marooned, myself, programming with half the registers we have now.
I was kind of wondering if anyone could come up with something really compelling to do with those two extra instruction bits.
Well, see my other post. But, if you were to go to 8-bit addresses, one thing you could do is replace AUGI with an encoding of up to three additional "operand" registers following an instruction, each of which would be the full 32-bit value instead of 25 bits. These would always be in the first three stages of the pipeline. And with it, you could support all sorts of instructions, which I'm sure this group could easily come up with. Or you could easily inline constants.
And another thought. You could use the bits to enable relative addressing for D and S.
I don't see anything compelling enough to warrant shrinking the COG. We've got a lot of features to support big requirements now. Having the COG run like it does in P1 allows for a simple, very high performance programming model.
That is important for those things where the P2 isn't necessarily running from external RAM and where it's performing a lot of different functions. As a use case, mostly COG code represents one extreme of real-time + performance capability, where the other extreme may be more tasks, some threads of compiled code, etc... and even that may well benefit from a roomy COG, depending on how we end up using the register space on big programs.
I would hate to change the COG size before we've seen the other powerful features play out.
Another need for cog memory is textures! Now with hubexec, we can use pretty much all of it for textures instead of mixed code and textures! Textured rendering is going to look even better!
A big part of the instructions are SET and GET instructions that access hardware-registers
OK, so instead of millions of hardware configuration registers, like a typical MCU, we have millions of different opcodes.
May be we could alias them to "mov" in the assembler.
...execution modes add a lot of complexity, mainly hardware tasks (but that's your fault)
Guilty as charged.
jmg,
Some might see 7, I would see that as only 3 - and that is not uncommon on small Micros.
1) COG/HUB
2) Single/Multi threaded
3) Stack in HUB/AUX
I'm curious, What small micros have those three options? Closest I can come up with is XMOS who have hardware thread scheduling (which cannot be turned off).
mindrobots,
The manual will be a BEAST...
We are looking at 200 pages just to describe all the instructions in Propeller manual style. Another hundred or so to explain hw it
all fits together!
You are right Chip we don't need to waste space on all that COG memory.
Now that we can execute from HUB directly we don't need all those COG locations. We only need a few registers to keep compilers happy, an accumulator register and some others for scratch space. We could call them AX, BX, CX and DX for simplicity. In case we want to work with 16 bit words or bytes we could access half registers, something like AH for high byte AL for low byte and so on.
We will need a register for the stack pointer, call it SP, and to make life easy using local variables in functions we could have a stack frame pointer, call it BP (base pointer)
For string moves and another indexed addressing a couple of index registers would be useful, say SI and DI (src/dst index)
For computed jumps and such it's good to have access to the program counter, call it IP (instruction pointer)
One day we will want more than 256K or HUB so addressing that will be a problem. Better include some memory page or "segment" address registers. Perhaps DS, CS, SS for "data segment". "code segment" and "stack segment". Better throw another segment in just in case we need it "ES for extra segment"
So we only need 13 COG registers: AX, BX, CX, DX, SI, DI, IP, SP, BP, DS, CS, SS, ES. That should be enough that the P2 architecture is useful for the next 30 years or so.
jmg,
I'm curious, What small micros have those three options? Closest I can come up with is XMOS who have hardware thread scheduling (which cannot be turned off).
I was meaning Multiple memory models that the compiler can choose from, as well as differing options on Stack handling.
Some add choices for parameters in registers or memory or on stack.
So these are the only registers we need: AX, BX, CX, DX, SI, DI, IP, SP, BP, DS, CS, SS, ES. That should be enough that the P2 architecture is useful for the next 30 years or so.
Guys...stop beating on me...I was only joking...
Yikes!
I was starting to feel quite ill then, Phew!
Chip,
I am curious as to whether we could split the cog ram into 2x 256 longs (less the few registers) where in hubexec mode the upper 256 longs could be used instead for larger hub cache memory? This would save the 8x 8xlong cache lines, but instead of 8 lines we would have 32 lines.
If the cog was split into 2 256 longs, might we be able to use the upper half for the aux ram under some circumstances - could save aux ram? I have not totally thought all the implications through. But perhaps if the upper was not usable for cache lines, perhaps it could be used for the clut instead of having aux.
Would any of these changes give us any more hub ram- maybe another 64KB ?
Perhap only 512 longs, quad port, no aux ram. But 3 possible configuration uses (per cog):
* 512 longs cog ram (less ~14 regs)
* 256 longs cog ram + 256 as 32x8long cache lines (hubexecmode)
* 256 longs cog ram + 256 long clut
In the last case, the 256 long clut could be wide loaded from hub using a special non-stalling instruction. Permit additional slot sharing for this?
Comments
Am I the only one who totally forgot about this?
No. As long as Chip doesn't forget!
I am not at all worried with the number of instructions. Many will only be usedfor special cases. What will make this easy is lots of sectionalised lists of instructions (such as a list of bit(s) manipulation instructions, etc)
I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.
Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.
Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.
My mistake!!! It DOES use Port D. I'm getting snow blind.
Only time will tell. Clearly, neither of these were available with P1.
If You can made COG's with more that 512 longs --- That OK.
BUT never shrink it
Pros:
- might allow a very significant increase in number of icache & dcache lines.
- might allow an increase in number of cogs
Cons:
- reduce possible size of tasks
- cut the potential size of fully deterministic cog code in half
Shrink COG's is not good idea.
BUT is it possible to made COG's space be addressed as 2 separate 256 Long's
And in HUBEXEC mode reuse one of 256 part of COG as that registers You NEED?
While I look forward to the new world of "HUB exec", I still see a need for 512 long cog ram code. This keeps the P1 concept intact for users
making the transition from P1 to P2. For PURE performance nothing beats good old tight PASM code for speed.
On the subject of the P2 manual/app notes, I put my hand up to volunteer contributing to such a document. When we have FPGA and silicon to play with
I think you will see a flurry of useful code and notes burst onto the forum. I know I have a few things on the go and I know I am not alone here.
"Build it and the docs will come"
Edit: 512 longs is still very useful for multi-tasking apps which in some cases have power saving possibilities.
Sounds like an exercise for future in-depth FPGA investigation ... with a Prop3 in mind.
Interesting.
If there was no Prop1, and no threads, that might make sense.
The presence of P1 rather dictates a minimum of 512, and Threads mean users will tend to pack more into one COG, so the extra is not unused nearly as frequently as it is in P1
A hybrid solution could be to make some of the 512, 2 port memory, thus shrinking the die size but keeping the total 512.
Code fetch and Load/store work in the 2 port memory, but not the full dual-operand opcodes.
- but I'd keep that sort of change for drastic cases like only if the die fails to fit, as it makes users take more care in the memory map, and would make porting P1 code harder.
I can think of several things worth doing without HUBEXEC that would benefit from 512, particularly when both the algorithm and data need room in the cog.
Please if you can manage it don't shrink the COG register space down to 256. Having more than 256 COG LONGs is still going to be very useful in cases where you might want to simply port over existing P1 COG code that already needs more room to run, or you need fast byte addressable look up tables plus some extra COG space for other data/instructions or other thread's code, and you can't spare the AUX RAM for this as you need it for other purposes such as a call stack or video buffer etc. Having 512 longs on P1 (or really 496) was always handy in 256 entry LUT cases, even though you sometimes needed to be aware of the unusual 9 bit D/S field encodings in the instructions.
I don't think we're going to shrink it. I think I'd feel marooned, myself, programming with half the registers we have now.
I was kind of wondering if anyone could come up with something really compelling to do with those two extra instruction bits.
I suspect that routing is a problem now because of % usage.
I like Sapieha's idea of choosing how half of COG memory is used based on HUBEXEC mode and COG mode. That should remove the need for memory we wouldn't otherwise see anyway, will allow COG memory to stay 512 longs, and should allow for a bigger cache in HUBEXEX too (very desirable). That's a win-win-win situation.
One approach would be to use PortD for signalling. When One COG sets a particular bit, another COG can efficiently perform a HUBOP.
There is just so much code that requires 512 longs to be able to run, period. When you absolutely must have deterministic execution, in-COG is the only solution.
LOL! I have a draft blog entry suggesting this very thing! Here are some additional thoughts (from that entry, now that I won't be posting it):
Also, for those commenting about writing deterministic code, how often does the entire cog code need to be deterministic, as opposed to just specific segments/routines? I'm not saying the issue is being exaggerated, I'm just asking if it is.
Well, see my other post. But, if you were to go to 8-bit addresses, one thing you could do is replace AUGI with an encoding of up to three additional "operand" registers following an instruction, each of which would be the full 32-bit value instead of 25 bits. These would always be in the first three stages of the pipeline. And with it, you could support all sorts of instructions, which I'm sure this group could easily come up with. Or you could easily inline constants.
And another thought. You could use the bits to enable relative addressing for D and S.
That is important for those things where the P2 isn't necessarily running from external RAM and where it's performing a lot of different functions. As a use case, mostly COG code represents one extreme of real-time + performance capability, where the other extreme may be more tasks, some threads of compiled code, etc... and even that may well benefit from a roomy COG, depending on how we end up using the register space on big programs.
I would hate to change the COG size before we've seen the other powerful features play out.
Another need for cog memory is textures! Now with hubexec, we can use pretty much all of it for textures instead of mixed code and textures! Textured rendering is going to look even better!
May be we could alias them to "mov" in the assembler. Guilty as charged.
jmg, I'm curious, What small micros have those three options? Closest I can come up with is XMOS who have hardware thread scheduling (which cannot be turned off).
mindrobots, We are looking at 200 pages just to describe all the instructions in Propeller manual style. Another hundred or so to explain hw it
all fits together!
You are right Chip we don't need to waste space on all that COG memory.
Now that we can execute from HUB directly we don't need all those COG locations. We only need a few registers to keep compilers happy, an accumulator register and some others for scratch space. We could call them AX, BX, CX and DX for simplicity. In case we want to work with 16 bit words or bytes we could access half registers, something like AH for high byte AL for low byte and so on.
We will need a register for the stack pointer, call it SP, and to make life easy using local variables in functions we could have a stack frame pointer, call it BP (base pointer)
For string moves and another indexed addressing a couple of index registers would be useful, say SI and DI (src/dst index)
For computed jumps and such it's good to have access to the program counter, call it IP (instruction pointer)
One day we will want more than 256K or HUB so addressing that will be a problem. Better include some memory page or "segment" address registers. Perhaps DS, CS, SS for "data segment". "code segment" and "stack segment". Better throw another segment in just in case we need it "ES for extra segment"
So we only need 13 COG registers: AX, BX, CX, DX, SI, DI, IP, SP, BP, DS, CS, SS, ES. That should be enough that the P2 architecture is useful for the next 30 years or so.
Guys...stop beating on me...I was only joking...
I was meaning Multiple memory models that the compiler can choose from, as well as differing options on Stack handling.
Some add choices for parameters in registers or memory or on stack.
Yikes!
I was starting to feel quite ill then, Phew!
I am curious as to whether we could split the cog ram into 2x 256 longs (less the few registers) where in hubexec mode the upper 256 longs could be used instead for larger hub cache memory? This would save the 8x 8xlong cache lines, but instead of 8 lines we would have 32 lines.
If the cog was split into 2 256 longs, might we be able to use the upper half for the aux ram under some circumstances - could save aux ram? I have not totally thought all the implications through. But perhaps if the upper was not usable for cache lines, perhaps it could be used for the clut instead of having aux.
Would any of these changes give us any more hub ram- maybe another 64KB ?
Perhap only 512 longs, quad port, no aux ram. But 3 possible configuration uses (per cog):
* 512 longs cog ram (less ~14 regs)
* 256 longs cog ram + 256 as 32x8long cache lines (hubexecmode)
* 256 longs cog ram + 256 long clut
In the last case, the 256 long clut could be wide loaded from hub using a special non-stalling instruction. Permit additional slot sharing for this?