Propeller II update - BLOG

potatohead · 2014-01-17 14:18

lol

Am I the only one who totally forgot about this?

mindrobots · 2014-01-17 14:43

potatohead wrote: »

lol

Am I the only one who totally forgot about this?

No. As long as Chip doesn't forget!

Cluso99 · 2014-01-17 15:23

cgracey wrote: »

No. There is a 32-bit XCH (exchange) system that can route 32-bits per clock between/among any/all cogs.

I must have missed this one, or thought it used port D. Another brilliant feature that will notonly be faster, but avoids hub slots, and saves port D too.

I am not at all worried with the number of instructions. Many will only be usedfor special cases. What will make this easy is lots of sectionalised lists of instructions (such as a list of bit(s) manipulation instructions, etc)

David Betz · 2014-01-17 15:26

Cluso99 wrote: »

I must have missed this one, or thought it used port D. Another brilliant feature that will notonly be faster, but avoids hub slots, and saves port D too.

I am not at all worried with the number of instructions. Many will only be usedfor special cases. What will make this easy is lots of sectionalised lists of instructions (such as a list of bit(s) manipulation instructions, etc)

I agree that categorized lists will be very useful. I'm tired of manuals these days that just give you an alphabetical list of "things" that include instructions, condition codes, pseudo-ops, and other random things all sorted together. Makes it almost impossible to find something unless you know it's name.

cgracey · 2014-01-17 15:29

I've been thinking about how a cog has 512 registers.

I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.

Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.

cgracey · 2014-01-17 15:31

Cluso99 wrote: »

I must have missed this one, or thought it used port D. Another brilliant feature that will notonly be faster, but avoids hub slots, and saves port D too.

I am not at all worried with the number of instructions. Many will only be usedfor special cases. What will make this easy is lots of sectionalised lists of instructions (such as a list of bit(s) manipulation instructions, etc)

My mistake!!! It DOES use Port D. I'm getting snow blind.

Cluso99 · 2014-01-17 15:32

What might makesense with P2 is comms between cogs where we use hub fifos (FullDuplexSerial, etc), we also set a bit in port D to indicate a byteis available, and the cog waitpeq on this, ratheer than using a hub slot in a tight loop, freeing the slot.

Only time will tell. Clearly, neither of these were available with P1.

Sapieha · 2014-01-17 15:36

Hi Chip.

If You can made COG's with more that 512 longs --- That OK.

BUT never shrink it

cgracey wrote: »

I've been thinking about how a cog has 512 registers.

I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.

Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.

Bill Henning · 2014-01-17 15:40

Interesting idea.

Pros:

- might allow a very significant increase in number of icache & dcache lines.
- might allow an increase in number of cogs

Cons:

- reduce possible size of tasks
- cut the potential size of fully deterministic cog code in half

cgracey wrote: »

I've been thinking about how a cog has 512 registers.

I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.

Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.

Sapieha · 2014-01-17 15:52

Hi Chip.

Shrink COG's is not good idea.

BUT is it possible to made COG's space be addressed as 2 separate 256 Long's

And in HUBEXEC mode reuse one of 256 part of COG as that registers You NEED?

Sapieha wrote: »

Hi Chip.

If You can made COG's with more that 512 longs --- That OK.

BUT never shrink it

ozpropdev · 2014-01-17 16:00

cgracey wrote: »

I've been thinking about how a cog has 512 registers.

I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.

Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.

While I look forward to the new world of "HUB exec", I still see a need for 512 long cog ram code. This keeps the P1 concept intact for users
making the transition from P1 to P2. For PURE performance nothing beats good old tight PASM code for speed.

On the subject of the P2 manual/app notes, I put my hand up to volunteer contributing to such a document. When we have FPGA and silicon to play with
I think you will see a flurry of useful code and notes burst onto the forum. I know I have a few things on the go and I know I am not alone here.

"Build it and the docs will come"

Edit: 512 longs is still very useful for multi-tasking apps which in some cases have power saving possibilities.

evanh · 2014-01-17 16:16

cgracey wrote: »

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.

Sounds like an exercise for future in-depth FPGA investigation ... with a Prop3 in mind.

jmg · 2014-01-17 17:12

cgracey wrote: »

I've been thinking about how a cog has 512 registers.

I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.

Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.

Interesting.
If there was no Prop1, and no threads, that might make sense.
The presence of P1 rather dictates a minimum of 512, and Threads mean users will tend to pack more into one COG, so the extra is not unused nearly as frequently as it is in P1

A hybrid solution could be to make some of the 512, 2 port memory, thus shrinking the die size but keeping the total 512.
Code fetch and Load/store work in the 2 port memory, but not the full dual-operand opcodes.

- but I'd keep that sort of change for drastic cases like only if the die fails to fit, as it makes users take more care in the memory map, and would make porting P1 code harder.

localroger · 2014-01-17 17:34

cgracey wrote: »

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering.

I can think of several things worth doing without HUBEXEC that would benefit from 512, particularly when both the algorithm and data need room in the cog.

rogloh · 2014-01-17 17:42

cgracey wrote: »

I've been thinking about how a cog has 512 registers.

I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.

Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.

Please if you can manage it don't shrink the COG register space down to 256. Having more than 256 COG LONGs is still going to be very useful in cases where you might want to simply port over existing P1 COG code that already needs more room to run, or you need fast byte addressable look up tables plus some extra COG space for other data/instructions or other thread's code, and you can't spare the AUX RAM for this as you need it for other purposes such as a call stack or video buffer etc. Having 512 longs on P1 (or really 496) was always handy in 256 entry LUT cases, even though you sometimes needed to be aware of the unusual 9 bit D/S field encodings in the instructions.

cgracey · 2014-01-17 17:57

rogloh wrote: »

Please if you can manage it don't shrink the COG register space down to 256.

I don't think we're going to shrink it. I think I'd feel marooned, myself, programming with half the registers we have now.

I was kind of wondering if anyone could come up with something really compelling to do with those two extra instruction bits.

jazzed · 2014-01-17 18:00

cgracey wrote: »

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed ....

I suspect that routing is a problem now because of % usage.

I like Sapieha's idea of choosing how half of COG memory is used based on HUBEXEC mode and COG mode. That should remove the need for memory we wouldn't otherwise see anyway, will allow COG memory to stay 512 longs, and should allow for a bigger cache in HUBEXEX too (very desirable). That's a win-win-win situation.

Seairth · 2014-01-17 19:15

ctwardell wrote: »

One of the outcomes of slot sharing, if we get it, is likely to be pressure to not poll the hub in a tight loop unless really needed.
We will likely want to come up with some good non-Hub methods of signaling between cogs so data only needs read from the hub when ready instead of being constantly polled using hub reads.

One approach would be to use PortD for signalling. When One COG sets a particular bit, another COG can efficiently perform a HUBOP.

pedward · 2014-01-17 19:29

Ugh, 256 longs would make a ton of existing code impossible on the P2. Just because *1* thread can run full speed when the stars are aligned, doesn't mean we all want that handicap.

There is just so much code that requires 512 longs to be able to run, period. When you absolutely must have deterministic execution, in-COG is the only solution.

Seairth · 2014-01-17 19:35

cgracey wrote: »

I've been thinking about how a cog has 512 registers.

I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.

Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.

LOL! I have a draft blog entry suggesting this very thing! Here are some additional thoughts (from that entry, now that I won't be posting it):

This makes sense because its likely that the only code that will still be running in cog mode will be self-modifying instructions and small, timing critical code blocks. And with the ease of switching between modes, code organized this way will be fairly straight forward.
If, for now, you keep 9-bit addressing, you can still use the upper 256 addresses to access special-purpose registers (INDx, DIRx, etc). No more "shadow" registers! In fact, you could even revisit PINx (just kidding!).Or spread PortD to 8 separate 32-bit ports where each cog has one output port and seven input ports wired to the other cogs (not kidding!). Anyhow, you get the idea.

Seairth · 2014-01-17 19:47

Several response to the 256 vs 512 registers question suggested this would harm tasks. I thought it was going to be possible for all of the tasks to be in hub-mode at the same time. If that's the case, I don't see the problem.

Also, for those commenting about writing deterministic code, how often does the entire cog code need to be deterministic, as opposed to just specific segments/routines? I'm not saying the issue is being exaggerated, I'm just asking if it is.

Seairth · 2014-01-17 20:06

cgracey wrote: »

I don't think we're going to shrink it. I think I'd feel marooned, myself, programming with half the registers we have now.

I was kind of wondering if anyone could come up with something really compelling to do with those two extra instruction bits.

Well, see my other post. But, if you were to go to 8-bit addresses, one thing you could do is replace AUGI with an encoding of up to three additional "operand" registers following an instruction, each of which would be the full 32-bit value instead of 25 bits. These would always be in the first three stages of the pipeline. And with it, you could support all sorts of instructions, which I'm sure this group could easily come up with. Or you could easily inline constants.

And another thought. You could use the bits to enable relative addressing for D and S.

potatohead · 2014-01-17 20:38

I don't see anything compelling enough to warrant shrinking the COG. We've got a lot of features to support big requirements now. Having the COG run like it does in P1 allows for a simple, very high performance programming model.

That is important for those things where the P2 isn't necessarily running from external RAM and where it's performing a lot of different functions. As a use case, mostly COG code represents one extreme of real-time + performance capability, where the other extreme may be more tasks, some threads of compiled code, etc... and even that may well benefit from a roomy COG, depending on how we end up using the register space on big programs.

I would hate to change the COG size before we've seen the other powerful features play out.

Roy Eltham · 2014-01-17 22:11

Chip,

Another need for cog memory is textures! Now with hubexec, we can use pretty much all of it for textures instead of mixed code and textures! Textured rendering is going to look even better!

cgracey · 2014-01-17 22:28

Thanks for your thoughts on 256 registers, Everyone. We'll keep 512, but it was worth investigating a little.

Heater. · 2014-01-17 23:56

Ariba,

...do you feel better now?

Not sure yet...

A big part of the instructions are SET and GET instructions that access hardware-registers

OK, so instead of millions of hardware configuration registers, like a typical MCU, we have millions of different opcodes.

May be we could alias them to "mov" in the assembler.

...execution modes add a lot of complexity, mainly hardware tasks (but that's your fault)

Guilty as charged.

jmg,

Some might see 7, I would see that as only 3 - and that is not uncommon on small Micros.
1) COG/HUB
2) Single/Multi threaded
3) Stack in HUB/AUX

I'm curious, What small micros have those three options? Closest I can come up with is XMOS who have hardware thread scheduling (which cannot be turned off).

mindrobots,

The manual will be a BEAST...

We are looking at 200 pages just to describe all the instructions in Propeller manual style. Another hundred or so to explain hw it
all fits together!

Heater. · 2014-01-18 00:17

Re: 256 registers.

You are right Chip we don't need to waste space on all that COG memory.

Now that we can execute from HUB directly we don't need all those COG locations. We only need a few registers to keep compilers happy, an accumulator register and some others for scratch space. We could call them AX, BX, CX and DX for simplicity. In case we want to work with 16 bit words or bytes we could access half registers, something like AH for high byte AL for low byte and so on.

We will need a register for the stack pointer, call it SP, and to make life easy using local variables in functions we could have a stack frame pointer, call it BP (base pointer)

For string moves and another indexed addressing a couple of index registers would be useful, say SI and DI (src/dst index)

For computed jumps and such it's good to have access to the program counter, call it IP (instruction pointer)

One day we will want more than 256K or HUB so addressing that will be a problem. Better include some memory page or "segment" address registers. Perhaps DS, CS, SS for "data segment". "code segment" and "stack segment". Better throw another segment in just in case we need it "ES for extra segment"

So we only need 13 COG registers: AX, BX, CX, DX, SI, DI, IP, SP, BP, DS, CS, SS, ES. That should be enough that the P2 architecture is useful for the next 30 years or so.

Guys...stop beating on me...I was only joking...

jmg · 2014-01-18 01:19

Heater. wrote: »

jmg,
I'm curious, What small micros have those three options? Closest I can come up with is XMOS who have hardware thread scheduling (which cannot be turned off).

I was meaning Multiple memory models that the compiler can choose from, as well as differing options on Stack handling.
Some add choices for parameters in registers or memory or on stack.

ozpropdev · 2014-01-18 01:21

Heater. wrote: »

So these are the only registers we need: AX, BX, CX, DX, SI, DI, IP, SP, BP, DS, CS, SS, ES. That should be enough that the P2 architecture is useful for the next 30 years or so.

Guys...stop beating on me...I was only joking...

Yikes!
I was starting to feel quite ill then, Phew!

Cluso99 · 2014-01-18 03:25

Chip,
I am curious as to whether we could split the cog ram into 2x 256 longs (less the few registers) where in hubexec mode the upper 256 longs could be used instead for larger hub cache memory? This would save the 8x 8xlong cache lines, but instead of 8 lines we would have 32 lines.

If the cog was split into 2 256 longs, might we be able to use the upper half for the aux ram under some circumstances - could save aux ram? I have not totally thought all the implications through. But perhaps if the upper was not usable for cache lines, perhaps it could be used for the clut instead of having aux.

Would any of these changes give us any more hub ram- maybe another 64KB ?

Perhap only 512 longs, quad port, no aux ram. But 3 possible configuration uses (per cog):
* 512 longs cog ram (less ~14 regs)
* 256 longs cog ram + 256 as 32x8long cache lines (hubexecmode)
* 256 longs cog ram + 256 long clut

In the last case, the 256 long clut could be wide loaded from hub using a special non-stalling instruction. Permit additional slot sharing for this?

Propeller II update - BLOG

Comments