Random thought for Prop II: Separate execution units for hub ops.

Phil Pilgrim (PhiPi) · 2008-05-29 21:24

Several times, while writing code that uses a RDLONG or WRLONG, for example, I've thought it would be nice not to have to wait for the operation to finish when subsequent code didn't depend on the result right away. This got me thinking: why not make the hub interface in the Prop II a separate execution unit with its own pipeline? Whether or not to wait for the result could be determined by the "wc" flag, which isn't currently used for anything in those instructions. For compatibility with current code "wc" would mean "don't wait for completion" (despite the unfortunate "acronym" hinting at the opposite). The carry flag itself could even be used to indicate when the operation finally did complete: clearing initially, and setting consequently. (However, such actions could be an annoyance when nearby ops need the carry for something else.) Another option would be a WAITHUB instruction, which would wait until the hub execution unit is idle. This could be accommodated in the HUBOP family of instructions, extending the required number of decoded source bits from three to four.

Extending the idea further still, each cog's hub controller could also include its own execution queue, which could hold several sequential hub instructions at once, along with their arguments. This would allow even more efficient transfer of multiple pieces of data.

One obstacle to overcome would be the "cycle-stealing" required by a hub read operation for writing the result to memory when it became available. Interrupting the main execution unit's pipeline for even one cycle would have a deleterious effect on determinism. OTOH, even that could be planned around during programming in the same way that hub ops are now, in order to hit the "sweet spot".

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 5/29/2008 9:39:36 PM GMT

rokicki · 2008-05-29 21:51

I think this is absolutely critical to the Propeller II's success. We must have a speculative, out-of-order, restartable-pipline
with branch prediction, a full SIMD set of instructions and registers, at *least* 4GB L0 cache (128-way set associative, please),
and integrated DDR3 memory controller.

Frankly, I'm just stunned that the Prop 1 did not include all of these features.

[noparse][[/noparse]Yes, I'm being facetious and somewhat mean. But even now the timing of the prop is complicated for some people to
understand, even though said deterministic timing is one of the main neat features of the prop. So I think this idea,
Phil's idea above, is a neat idea, I still like the existing Prop's nice and steady, relatively simple operation.]

Phil Pilgrim (PhiPi) · 2008-05-29 22:21

For single hub write operations, at least, a separate execution unit would simplify timing and make it more deterministic, since there would be no variable wait for the "commutator" to come around. Only read operations would insert cycles into the main execution stream. Of course, my entire comment is predicated on a cog-hub architecture similar to the current one. This may, in fact, be an invalid assumption. For example, hub accesses which are simply faster might trump any need for separate pipelines.

-Phil

cgracey · 2008-05-30 02:06

Phil,

I like the idea of decoupling the COG and HUB operations. The only way I can see this working is to buffer HUB instructions·and have·them write·their 'read' results into dedicated COG registers. This would keep the COG execution pipeline determinate This kind of fits in with other things that are shaping up, like a register-based CORDIC system for each COG. We may need new WAITxxx instructions to simply hold off until mission-accomplished. This makes the COG more like a kitchen, with a toaster, range, and blender - all of which can be turned on without waiting for the results.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi) · 2008-05-30 02:50

Chip,

The dedicated read register(s) would, indeed, solve the determinacy problem. In such a scenario, the WAITxxx instruction could actually combine a WAIT with a MOV, from the dedicated register to the chosen destination. This would preclude having to allocate any memory address(es) for the read register(s), since their addressing would be implicit in the WAIT-MOV (MOVHUB?) instruction(s). Basically, this just breaks a RDLONG, say, into two instructions: one to begin the operation, the other to get the result into the destination.

Building on this, the implicitly-addressed read register could be organized as a multi-long queue, which would allow an even more efficient transfer of multiple data values. In a RDLONG dst,src wc, for example, the destination address could indicate the number of units to transfer to the queue, beginning from the hub address indicated by the source. Each subsequent MOVHUB instruction would remove one datum from the queue to the designated destination register.

Further still, perhaps a MOVHUB instruction wouldn't necessarily have to wait for available data if none was available. By doing a MOVHUB dst wc, one could indicate (via wc) not to block on an empty queue but, rather, to set the carry flag depending on whether a datum was actually transferred.

-Phil

Cluso99 · 2008-05-30 08:43

My 2 cents worth...

1. Let the hub run the round-robbin architecture, but if the next cog doesn't have any data waiting (readlong/writelong) then don't waste the cycle, but give it to the next waiting cog. In that way the 16 cycle wait could be reduced, and hence latency, if no other cog wanted their timeslice. Hopefully the setup time for the hub read/write instructions can also be reduced in the Prop II.

2. Add readlong/writelong with a count that would be incremented in both hub and cog. Not sure where (a register?) would store the count to be decremented by the hub for the counter.

I would like to be able to copy a block of data to/from the cog memory quickly. If this can be done, it would also be possible to quickly copy assembler routines (or data blocks) to/from the cog

I am sure both of these ideas would be simpler to implement in silicon than some of the other proposals.

Ken Peterson · 2008-05-30 20:36

Cluso99: I believe one of the nice things about the propeller architecture is the fact that it is very deterministic so that timing can be precisely controlled by counting instructions. If you have variable latency with hub accesses, this might prove to be more difficult.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Cluso99 · 2008-05-31 02:32

Ken: I consider the hub to be the bottleneck in a fantastic design. Given the restrictions of 496 longs per cog, moving data in/out of the cog is a major concern. It cannot be accomplished fast enough and any loss in deterministic behaviour here would be far outweighed by the performance boost. To get over the cog ram limitations it is necessary to move code and/or data in and out of the cog efficiently. It can be via a new instruction or whatever, but don't waste the unused hub cycles.

As I said, my 2 cents worth

Random thought for Prop II: Separate execution units for hub ops.

Comments