Getting more out of the P2...

jmg · 2016-05-03 22:48

Elctrodude wrote: »

...
It just struck me the other day that the fastest way to do short conditionals, like "if (cond) { ... } else { ... }" and "cond ? a : b" with bodies shorter than 16 or so instructions would be to test the condition, get the result into z or c, and then if_z or if_nz all of the following instructions.

Yes, that is the Skip approach, and it is well suited to any code-feeder that does not like a change of direction, so that also includes Serial memory and XIP designs.
(and even some MCUs with pipelines ... )

Hopefully, Code Generators will use this approach more and more.

Electrodude · 2016-05-03 23:25

evanh wrote: »

HubExec branching will always cause a FIFO reload, I believe, a branch effective issues a RDFAST to redirect the FIFO. The FIFO pre-fetchs a rotation ahead and has no internal addressability. The Cog must accept everything the FIFO feeds it consecutively, or issue a reload.

But jumps can still be aligned to minimize timing, so that the hub will be at the best possible place at the instant the virtual RDFAST is issued.

Are you saying the hub will always go around once after a RDFAST, or that it takes a while for data to filter through the FIFO after it's read before it's accessible to the cog, or something like that?

Dave Hein · 2016-05-03 23:33

Code and data alignment are going to be very tricky with the P2. The alignment will depend on which cog is executing the code. Specialized bits of tweaked code can be written to take advantage of this, but the eggbeater architecture will generate a lot of hub stalls for random access.

jmg · 2016-05-03 23:36

Electrodude wrote: »

But jumps can still be aligned to minimize timing, so that the hub will be at the best possible place at the instant the virtual RDFAST is issued.

Maybe ideally, but that sort of align quickly gets more complex, as you have more jumps.

Electrodude wrote: »

Are you saying the hub will always go around once after a RDFAST, or that it takes a while for data to filter through the FIFO after it's read before it's accessible to the cog, or something like that?

Anything that is not naturally random access, is going to have fish-hooks.

You can mitigate those, for example in theory, a forward short jump could just move up the FIFO ( I don't think HUBEXEC actually does this now ), but even with that level of smarts, at some stage you may have bandwidth limitations force a stall, which could be one of those rare-but-nasty bugs.

All of this is why the Skip approach you already mentioned, is a good way to avoid 'stutter gotchas', and the hardware is simpler too...

78rpm · 2016-05-04 00:10

Dave Hein wrote: »

Code and data alignment are going to be very tricky with the P2. The alignment will depend on which cog is executing the code. Specialized bits of tweaked code can be written to take advantage of this, but the eggbeater architecture will generate a lot of hub stalls for random access.

I do not think the cog id matters. It is the memory addresses which are important.

jmg · 2016-05-04 00:26

Dave Hein wrote: »

Code and data alignment are going to be very tricky with the P2. The alignment will depend on which cog is executing the code. Specialized bits of tweaked code can be written to take advantage of this, but the eggbeater architecture will generate a lot of hub stalls for random access.

True, but surely the tools will allow designers the control to avoid getting stuck in those areas ?
eg I would expect segment support so user code can target either COG or HUB exec, and time-critical code would be collected and placed in COG.

Of course, support for off-chip segments could be nice too, for rarely-used code to be left in XIP memory

Dave Hein · 2016-05-04 00:35

78rpm wrote: »

Dave Hein wrote: »

Code and data alignment are going to be very tricky with the P2. The alignment will depend on which cog is executing the code. Specialized bits of tweaked code can be written to take advantage of this, but the eggbeater architecture will generate a lot of hub stalls for random access.

I do not think the cog id matters. It is the memory addresses which are important.

With the eggbeater architecture the cog id does matter. The time slot for accessing a particular hub address is a function of the cog id and address bits 2 through 5.

78rpm · 2016-05-04 01:20

Dave Hein wrote: »

78rpm wrote: »

Dave Hein wrote: »

Code and data alignment are going to be very tricky with the P2. The alignment will depend on which cog is executing the code. Specialized bits of tweaked code can be written to take advantage of this, but the eggbeater architecture will generate a lot of hub stalls for random access.

I do not think the cog id matters. It is the memory addresses which are important.

With the eggbeater architecture the cog id does matter. The time slot for accessing a particular hub address is a function of the cog id and address bits 2 through 5.

You are correct.

Electrodude · 2016-05-04 02:09

Dave Hein wrote: »

Code and data alignment are going to be very tricky with the P2. The alignment will depend on which cog is executing the code. Specialized bits of tweaked code can be written to take advantage of this, but the eggbeater architecture will generate a lot of hub stalls for random access.

Dave Hein wrote: »

78rpm wrote: »

Dave Hein wrote: »

Code and data alignment are going to be very tricky with the P2. The alignment will depend on which cog is executing the code. Specialized bits of tweaked code can be written to take advantage of this, but the eggbeater architecture will generate a lot of hub stalls for random access.

I do not think the cog id matters. It is the memory addresses which are important.

With the eggbeater architecture the cog id does matter. The time slot for accessing a particular hub address is a function of the cog id and address bits 2 through 5.

Once you're lined up (via "rdlong dummy, #0" or "rdlong parameter, ##pointer_which_mod_16_equals_known_value" or something), you don't have to worry about it, unless you need multiple cogs outputting stuff perfectly in unison.

It will be almost the same as on the P1. As long as you follow a few rules, you can easily cycle-count your entire program, hubops and all, and rule out all non-determinism. On the P2, the only thing that will make it harder is that you'll have to watch out for the alignment of things.

Heater. · 2016-05-04 03:41

Electrodude,

It will be almost the same as on the P1. As long as you follow a few rules, you can easily cycle-count your entire program, hubops and all, and rule out all non-determinism. On the P2, the only thing that will make it harder is that you'll have to watch out for the alignment of things.

Once again I reminded of The Story of Mel:

http://www.pbm.com/~lindahl/mel.html

Electrodude · 2016-05-04 05:21

Heater. wrote: »

Electrodude,

It will be almost the same as on the P1. As long as you follow a few rules, you can easily cycle-count your entire program, hubops and all, and rule out all non-determinism. On the P2, the only thing that will make it harder is that you'll have to watch out for the alignment of things.

Once again I reminded of The Story of Mel:

http://www.pbm.com/~lindahl/mel.html

Cycle counting isn't that hard. And self-modifying code isn't hard either if it comes with comments and wasn't originally written in hexadecimal, like in The Story of Mel.

In fact, on the P1 at least, you don't even normally need to count cycles - you can get away with counting instructions, where 1 instruction = 4 cycles. All normal instructions take 4 cycles, jumps sometimes take 8 cycles, and hub cycles come every 16 cycles and take 7..22 cycles (but take 20, 16, 12, or 8 in practice, since you don't normally have waitcnts for weird amounts of time in critical sections). This means you can assume all normal instructions take 1 instruction time, jumps take 1 or 2 instruction times, and hub ops take 2-5. The amount of time, in instructions, that a hub op takes is ((instructions_between_hub_ops + 2) % 4 + 2), where instructions_between_hub_ops is in 4 cycle periods (i.e. normal instruction times). This is all assuming that you don't have any WAITxxx's anywhere - if you do, and don't know the duration of the wait mod 16, then the first hub op after the wait instruction takes a pretty much indeterminate amount of time.

A fancy IDE (like the one I'll write for my compiler that I'll probably never finish) could do cycle counting for you.

Heater. · 2016-05-04 10:10

Electrodude,

I've been around a bit. Done my fair share of cycle counting, self-modifying code and manually compiling to hexadecimal. Perhaps not all at the same time though!

One of the great attractions of programming in PASM on the Prop is the regularity of the instruction timing.

evanh · 2016-05-04 10:38

Electrodude wrote: »

Are you saying the hub will always go around once after a RDFAST, or that it takes a while for data to filter through the FIFO after it's read before it's accessible to the cog, or something like that?

I could be wrong and the only delay is until first long has been fetched, but I think RDFAST doesn't return until the FIFO has retrieved 16 longs. Chip uses the term "has new data".

evanh · 2016-05-04 10:55

It would good to have some way to pragmatically avoid the RDFAST instruction stalling when writing code for a soft device. Then one could use it to direct RFxxxx pre-fetches for a certain number of instructions later without having to run through blocks at a time.

jmg · 2016-05-05 01:19

evanh wrote: »

It would good to have some way to pragmatically avoid the RDFAST instruction stalling when writing code for a soft device. Then one could use it to direct RFxxxx pre-fetches for a certain number of instructions later without having to run through blocks at a time.

Optimizing compilers can do some pretty clever things.

I've been running simple mathop tests on intel's D2000, and there is some cleverness in action evident, with debug off.
Sadly, all that cleverness also fumbles and does inconsistent and strange things too... looks a bit 'alpha code' ?

Seairth · 2016-05-15 02:46

Returning to the topic, I've been watching the "Shared LUT" conversation (to put it mildly) and realized that a lot of the conversations like these are based on arguing extremes: maximum hub throughput, maximum inter-cog communication, minimum latency, etc. And the ability to handle those extremes is important, but I wonder if it's what's most important.

I suspect that the majority of the time, the P2 will be used for comparatively mundane and less-than-extreme needs. In other words, cogs will be underutilized, many/most of the smart pins will be idle, hub access will still be fast enough, etc. Yes, I'm sure that every one of us can easily think of an exceptional case, but I suspect most of those cases are limited to a subset of the chip's capabilities (e.g. "a cog that's generating video will be fully utilized" or "the fft calculations will consume three cogs running at full tilt" or "this parallel bus requires 17 smart pins"). But, for each of those examples, there are practically as many (or more) examples where the rest of the chip is mostly idling.

So, the question then becomes "does the current design work best for the mundane and less-than-extreme needs?" No, it's not as exciting as "maximize this" and "minimize that", but I think it is what we are really looking for most of the time.

Seairth · 2016-05-15 03:21

I've been pondering whether we really need 16 cogs or not. Is this another example of going for extremes instead of running for the mundane and not-so-extreme? Yes, more cogs opens more opportunities, but it's always with a trade-off. More of one thing means less of another thing: RAM, bandwidth, etc.

With the addition of smart pins, I suspect that many/most cogs will find themselves waiting. Of course, when a smart cell signals, we want a cog to respond quickly. And with 16 cogs at our disposal, we can just let them single-mindedly wait for those events.

But what if we could instead have a single cog service multiple disparate smart pin requirements (e.g. sync serial, I2C, and ADC capture)? Obviously, this won't always be possible, at least not beyond two or so unrelated I/O tasks. But, with so much work offloaded to the smart pins, this is not an unreasonable expectation or possibility.

And, given that, could we now reasonably get away with 8 cogs again? With that, would you now have the space to double the hub RAM (and reduce the egg-beater latency to 8 clocks)? Maybe widen the smart pin data paths to make transfers faster (and therefore free up more cycles for supporting other smart pins on the same cog)? And would these changes do a better job of supporting the mundane and less-than-extreme uses? Would it still be enough for those specific extreme uses? And does the rest of the hardware, as it currently stands, support an effective and straight-forward way to efficiently use as few cogs as possible?

Heater. · 2016-05-15 03:49

Seairth,

I've been pondering whether we really need 16 cogs or not.

Please don't do that. The question over the number of COGs is a debate that has been recurring here for ten years already.

Bottom line, for me at least, is that having multiple COGs is not about achieving maximal performance. I would care at all if I found COGs sitting idle 99% of the time. Rather it's about programming simplicity.

In the normal MCU we traditionally had a single processor. In order to handle multiple asynchronous events interrupts were invented. This works fine except it introduces programming complexity. You have to worry about the time budget of all those events happening at the same time. You have to worry about priority levels in order to ensure an event requiring low latency response is not handled late. You have to hook code into a interrupt system. It all difficult to reason about as everything has an impact on everything else. It makes importing code you have written yourself tougher.

Multiple COGs is a way to do what interrupts do. But because we have a processor dedicated to each event each event can be handled in isolation. that separation of concerns makes programs much easier to reason about. It makes mixing and matching drivers and other code from places like OBEX very easy. Just throw it into your program in you can be pretty sure it will work an importantly you can be sure it won't mess up timing for the code you have already so that will continue to work like nothing happened.

We has 8 COGs on the P1 and it was great. 16 is even better

But what if we could instead have a single cog service multiple disparate smart pin requirements

This is exactly the interrupt scheme that we all wanted to get away from as described above.

ozpropdev · 2016-05-15 03:59

IIRC Chip mentioned that increasing the HUB size with this process would have a large effect on yield too.

jmg · 2016-05-15 04:02

Seairth wrote: »

And, given that, could we now reasonably get away with 8 cogs again? With that, would you now have the space to double the hub RAM (and reduce the egg-beater latency to 8 clocks)? Maybe widen the smart pin data paths to make transfers faster (and therefore free up more cycles for supporting other smart pins on the same cog)? And would these changes do a better job of supporting the mundane and less-than-extreme uses? Would it still be enough for those specific extreme uses? And does the rest of the hardware, as it currently stands, support an effective and straight-forward way to efficiently use as few cogs as possible?

Interesting questions, and certainly the Pin_Cell has changed the scope of what any one COG can do.
It is a pity there is no 12 COG choice ...

Maybe when the next release is out, with the DAC-Data links and Fast Signaling, some use cases can be explored that try to fill up all COGs ?

Another way to phrase this is, what is the Memory Equivalent of 8 removed COGS ?

Brian Fairchild · 2016-05-15 08:21

evanh wrote: »

You're a hard case Leon. Maybe some of those accusations are deserved.

Leon wrote: »

I'm just pointing out that XMOS has been successful with similar chips to the P2 and Parallax can learn from them. What is wrong with that?

I know Leon gets a lot of grief from some people for mentioning Xmos but it's worth sticking 'Xmos' into eBay's search box and seeing what comes up. It's quite enlightening to see what you can get for not much money.

Seairth · 2016-05-15 12:14

Heater. wrote: »

Seairth,

I've been pondering whether we really need 16 cogs or not.

Please don't do that. The question over the number of COGs is a debate that has been recurring here for ten years already.

No worries. I'm pretty sure 16 cogs are here to stay. And I do realize that we've debated the 8-vs-16 cog variant many times. But I don't recall anyone asking the question once smart pins were added. Originally, we had 16 cogs precisely because the pins weren't smart (i.e. the cogs had to do all of the work). And with 64 pins, 8 cogs clearly weren't going to be enough. But is that still the case now?

Heater. wrote: »

Bottom line, for me at least, is that having multiple COGs is not about achieving maximal performance. I would care at all if I found COGs sitting idle 99% of the time. Rather it's about programming simplicity.

In the normal MCU we traditionally had a single processor. In order to handle multiple asynchronous events interrupts were invented. This works fine except it introduces programming complexity. You have to worry about the time budget of all those events happening at the same time. You have to worry about priority levels in order to ensure an event requiring low latency response is not handled late. You have to hook code into a interrupt system. It all difficult to reason about as everything has an impact on everything else. It makes importing code you have written yourself tougher.

Multiple COGs is a way to do what interrupts do. But because we have a processor dedicated to each event each event can be handled in isolation. that separation of concerns makes programs much easier to reason about. It makes mixing and matching drivers and other code from places like OBEX very easy. Just throw it into your program in you can be pretty sure it will work an importantly you can be sure it won't mess up timing for the code you have already so that will continue to work like nothing happened.

We has 8 COGs on the P1 and it was great. 16 is even better

But what if we could instead have a single cog service multiple disparate smart pin requirements

This is exactly the interrupt scheme that we all wanted to get away from as described above.

You may be right about all of that. I agree that combining functions on a single cog has its challenges. On the other hand, we clearly pay a price for the convenience of not having to do that. It's all about balance, and my concern was whether the balance for the current design is correct. For instance, would 1MB of hub RAM obviate the need, in most cases, to add external RAM? Does that outweigh having enough cogs to allow all code to be "single threaded"? Would reducing the egg-beater cycle to 8 clocks obviate the need, in most cases, for adding shared LUT capabilities? Would a 8-bit or 16-bit data bus to the smart pins free up enough clock cylces, in most cases, to offset the reduced number of cogs? What is the right balance?

Yes, i know this is a moving target. And I know that every one of us will have a slightly different opinion on what is the "right" balance. But, as I said in an earlier post, it occurred to me that we often start arguing for features as they would be used in extreme cases. With that sort of focus, I suspect we somewhat lose sight of maintaining balance in the overall design. Once a feature that's designed for the extreme has been added, it's hard to go back and remove it (because you would "lose an extreme ability", which sounds dire). That shifts the balance. Ideally, after each of these additions, we should have revisited the overall design and asked whether other features need to be adjusted to maintain a good balance. In hind sight, I don't think we've done that nearly enough...

Of course, now we need to get something out the door. So the design is what it is. I don't expect any big changes, so this discussion is mostly academic. We will make the P2 work, as it is, one way or another. Just like we did with the P1.

Cluso99 · 2016-05-15 12:40

I have often wondered if a dual 8 cog prop would work, with a small dual port hub section between them.
An alternate design of 8 sets of dual cog pairs. Only one cog of the pair had hub access, with the second cog sharing dual port LUT. This would make nice cooperating cogs.

Also, I wonder if there were only 32 identical smart pins, and all had access to the full 64 I/O pins with the same "OR" function as the cogs do with the pins now. This would likely reduce die space and make the smart pins more flexible.

Heater. · 2016-05-15 13:50

Clusso,

I have often wondered if a dual 8 cog prop would work, with a small dual port hub section between them.

That idea certainly has it's attractions. I seem to recall such things were suggested back in the mists of time. If not they should have been. It would halve HUB bandwidth stress.

Looking at it now it has the down side of halving RAM available to a COG thus scuppering any hope of running something like a JavaScript interpreter there.

Cluso99 · 2016-05-15 18:19

Heater,
Perhaps I should have also suggested that one only have a small hub ram. But then each would not be equal

Heater. · 2016-05-15 19:38

Ha yes,

Then we start approaching the Parallella Epiphany chip design. A matrix of processors connected north, south, east west by a bus in a toroidal topology. 32K RAM at each node.

Fast as hell. If you can figure out how to program it

Cluso99 · 2016-05-15 20:00

My though about dual 8 cog props were that one would handle mostly just I/O, especially if there were no smart pins. The other would handle higher level stuff. Things such as video ram (for games or whatever - because of the larger hub ram). Also the high level FAT32. Perhaps a basic OS.

Cluso99 · 2016-05-15 20:03

Would be interesting with shared LUT and 3 cogs, to see how fast we could get ZiCog running. Sort of a pipelined Z80 emulator.

jmg · 2016-05-15 20:15

Cluso99 wrote: »

Also, I wonder if there were only 32 identical smart pins, and all had access to the full 64 I/O pins with the same "OR" function as the cogs do with the pins now. This would likely reduce die space and make the smart pins more flexible.

I don't see how removing 32 pins makes those left 'more flexible', but yes, lowering the Smart Pin count was/is always a late-choice option, if die area was a killer.

Chip has changed the Smart-Pins somewhat, so that separate pin-cells manage Tx and Rx, which shrinks the Pin-cell size, but also makes halving the total count more of an issue.
There is also no binary lock on this, you can do 48 Pin cells too, should that be required.

jmg · 2016-05-15 20:27

Cluso99 wrote: »

I have often wondered if a dual 8 cog prop would work, with a small dual port hub section between them.
An alternate design of 8 sets of dual cog pairs. Only one cog of the pair had hub access, with the second cog sharing dual port LUT. This would make nice cooperating cogs.

hmm, Only one cog of the pair had hub access

Two weeks ago, the obvious issue there was, the relative isolation of the second COG. Too crippling.

-however-

the new DAC-Data pathway, has rather turned all that on its head.

We do really need hard numbers on all these pathways now.

Clearly, 8 hCOGS doubles the hub-slot rate, and halves the jitter.

Given the MUX and routing Chip was thinking of throwing at ANY-LUT access, it maybe possible to do this, as a boot-time choice, with less Logic ?

Choose, at reset, any of :

16 hCOGS -> All COGS are equal, tH = SysCLK/16 (present config)
8 hCOGS + 8 mCOGS -> tH = SysCLK/8, Minion COGS can talk to any other COG via DAC_Data
4 hCOGS + 12 mCOGS -> tH = SysCLK/4, Minion COGS can talk to any other COG via DAC_Data

Getting more out of the P2...

Comments