Hub timing using a power-of-two scheme

Seairth · 2014-05-05 19:58

(preamble part 1: If you think you already know what this is going to say, please do me the courtesy of still reading it in its entirety before commenting. This concept has, at least in my mind, gotten entangled with jmg's TopUsedCog concept, and this thread is partly an effort to untangle it. At the same time, based on the various questions asked, I feel I did a poor job of explaining my idea the first time around. So, this is also an attempt to explain it more clearly. Also, if discussion does happen, it won't clog up the main P1+ thread further.)

(preamble part 2: This is not an attempt to promote this scheme above all other ideas. I already know that it meets a certain amount of resistance and/or outright disagreement from some people. All I'm trying to do is document it separately from the others to try avoiding some of the confusion occuring in the P1+ thread. So, if you intend to argue against the idea and have already done so in the P1+ thread, please refrain from it here. In fact, if you have no interest in this idea at all, please feel free to ignore it altogether. You won't hurt my feelings.

)

(preamble part 3: My apologies for it's length. This is consolidating several separate posts into a single post, as well as being filled in with some additional details/thoughts. Also, you will find that I am repeating some bits over and over again, though hopefully in a variety of contexts that help give a clearer overall picture.)

Basic Concept

The power-of-two concept is simple: the hub access window is a round-robin scheme where the number of slots is 2^n, based on which cogs are currently active. The following table shows the combinations.

COG #  | 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
Active |                         x  x  x  x  x  x  x  x
       |             x  x  x  x
       |       x  x
       |    x
       | x
n      | 0  1  2  2  3  3  3  3  4  4  4  4  4  4  4  4
2^n    | 1  2  4  4  8  8  8  8 16 16 16 16 16 16 16 16

In other words, if the highest active cog number is between 9 and 16, the hub will use 16 slots (one for each of the 16 cogs). If the highest active cog number is between 5 and 8, the hub will instead use 8 slots (one for each cog from 1 to 8). If the highest active cog number is between 3 and 4, the hub will instead use 4 slots (one for each cog from 1 to 4). If the highest active cog number is 2, the hub will instead use 2 slots (one for each cog from 1 to 2). And finally, if the highest active cog number is 1, the hub will instead use 1 slot (for cog 1 only).

Assuming that the hub uses SYSCLK as a counter to determine which slot is active, it would currently be masking the lowest 4 bits (i.e. current_cog = SYSCLK & %%1111) to determine which cog has access to the hub. With the above scheme, this mask would be still be %%1111 for 16 slots, but would be %%0111 for 8 slots, %%0011 for 4 slots, %%0001 for 2 slots, and %%0000 for 1 slot. Put yet another way, the mask would be a 4-bit value with the "n" LSBs set to 1. And here is some verilog that shows one way to generate the mask:

// active_cogs is a 16-bit vector representing the active state of each of the cogs
function [3:0] GetHubMask(input [15:0] active_cogs);
	return {|active_cogs[15:8], |active_cogs[15:4], |active_cogs[15:2], |active_cogs[15:1]};
endfunction

So, to implement this, the following circuits would have to be changed:

Add a global 4-bit mask register that is ANDed with SYSCLK to determine which cog gains hub access. (the masking operation may already exist.)
Add a block to the global COGINIT and COGSTOP code which sets the mask register, using code like the function above.

What you get

Now, what do you get for this scheme? You get a hub access scheme that will ALWAYS give every cog access every 16 clock cycles (just as the P1+ currently does), but may also give up to the first 8 cogs access every 8 clock cycles, give up to the first 4 cogs access every 4 clock cycles, give up to the first 2 cogs access every 2 clock cycles, and give up to the first cog access every clock cycle.

Putting this another way, with this scheme, if you write a driver that is timed to perform a HUBOP every 16 clock cycles (which, again, is exactly the limitation the current P1+ design imposes), it will ALWAYS work as designed. Regardless of the number of cogs that are running. Regardless of which cogs are running. This is because every 2^n hub cycle combination will always result in every cog getting access to the hub every 16 clock cycles. If there are fewer cogs, it may get access to the hub more frequently, but it will always be guaranteed to have access every 16 clock cycles.

Now, suppose you are using only 8 cogs (or even as little as 5 cogs), and you would like to have at least one of them get more frequent access to the hub. No problem. Just make sure that you are running within the first 8 cogs, and they now all have access every 8 clock cycles. If you happen to use a driver that's timed for 16 clock cycles, it will continue to work without issue, because that cog is still getting access every 16 clock cycles (it will actually be getting it every 8 cycles, but the code is still tuned for every 16 cycles).

What's the most obvious use of this mode? Porting P1 applications! Ponder this: on the P1, you have access to the hub every 16 clocks, hubops take 8 cycles to complete, and most other instructions take 4 cycles to complete. This means that you can optimally place two regular (4-cycle) instructions between each hubop. Now, suppose you ported this to the current version of the P1+. Let's also assume that the hubop timing mirrors that of the P2 (4 clock for writes, 6 clocks for reads, ignoring the pipeline). You are now dealing with hubops of 4-6 clock cycles (1-2 instruction cycles), meaning you will need 6-7 instructions between hubops to be optimal. Obviously, this is going to require a great deal of refactoring! But, if you are cycling the hub every 8 clocks with the above scheme (which would be the case if you were running the same number of cogs as on the P1), you now only need 2-3 instructions between hubops to maintain optimal timing! So, for code following RDxxx operations, you wouldn't need to change your code at all, and code following WRxxx operations could be quickly corrected by adding a single NOP (or other minor tuning).

But, of course, it may also simply be that someone wants to have a few cogs have more frequent access to the hub than others. Or maybe the developer wants to run the chip at 100MHz and still have effectively the same hub access speed as a 16-cycle hub running at 200MHz. Or whatever. Simply put, if you can fit your project in the first 8 cogs, those cogs can take advantage of the double hub access speed. If a developer can't meet that limitation, then they are no worse of than they would be with the current P1+ design (i.e. 16-cycle hub timing). And if the developer has a really small project that can run in only 4 cogs? Then they can get 4x access to the hub (which would allow transfer rates using RDLONG/WRLONG to be comperable to RDQUAD/WRQUAD on the current 16 cycle hub timing).

Realistically, the most common timing scenarios are going to be 16 clock cycles and 8 clock cycles. Unless multitasking returns. In which I think you'd see the 4 clock cycle timing being regularly used as well. Speaking of multitasking, this scheme would work well in a multitasking cogs. The reason is simple: grouping multiple tasks on a cog reduces the number of cogs used, which potentially increases the hub access rate, which in turn reduces the stalls that the individual tasks would encounter during hubops. And, all the while, none of this adversely affects the single-tasked cog whose code expects 16 clock cycle timing (just like the current P1+).

Now, suppose you are starting a new project, and you want to make sure the hub timing does not change throughout. Simply make sure at least one of the used cogs is in the range of cog 9 to 16 (e.g. start your "main" code in cog 16). This forces all cogs, regardless of which and/or how many are running, to be limited to the 16 clock hub timing (just like the current P1+). Then, as you progress in the project, maybe you decide that you need increased hub access. If you are using 8 or fewer cogs and can run your code using the first 8 cogs, then you can get a 2x increase in hub accesses. And if any of the cogs were still expecting the 16 clock hub timing, they will still work exactly the same.

OBEX

This is fairly straight forward: if the driver is designed with 16 cycle hub timing (just like one would right now with the current P1+), it can safely be used in any project. If, however, a developer wants to write a driver that can take advantage of faster hub accesses, it's just a matter of making a statement like:

HUB: works from 4 to 16 cycles, optimized for 8 cycles

Or, if the developer wants to be more verbose:

HUB: XXXX max baud at 16 cycles, YYYY max baud at 8 cycles, ZZZZ max baud at 4 cycles

You get the idea. Yes, a developer could always omit this information or provide incorrect information. But that's nothing special to this scheme; it can happen with the existing OBEX just as easily. As for Parallax-developed drivers, as well as those developed by the seasoned contributors that we all trust, I have no doubt they can handle this.

How does this compare to other approaches?

No, this scheme isn't as flexible as some of the other approaches. And, considering the dependence on which cogs are active to determine the timing, it might not be usable as often as some of the other approaches. For instance, if it is still the case with P1+ that DACs are associated to specific cogs, then a physical design requirement might require a DAC on one of pins 33-64 to be used, and therefore cog 9-16. This scheme would not allow such designs to take advantage of faster hub accesses (which is exactly the same limitation as the current P1+ design, by the way).

On the other hand, this scheme is very simple to implement (I think). And this scheme is fully backward-compatible, in that it is guaranteed to not break determinism for any code that accesses the hub every 16 clocks (exactly like the current P1+). And, in many cases, projects will use 8 or fewer cogs, and therefore get more frequent access to the hub (if they want it) in a very simple to understand way. No configuration. No special instructions. No need to start cogs in a specific manner.

Summary?

Well, I guess I don't really have a summary. At the end of the day, if this idea doesn't go anywhere, that's fine. It's mainly food for thought. And maybe something that Chip finds worth implementing. Or not. Either way, I hope everyone who made it this far has a better (and, hopefully, full) understanding of the power-of-two scheme.

So, I'll just finish by reminding you of the bits I said in the preamble. If you do comment, try to make it constructive and on-topic. Thanks.

Seairth · 2014-05-05 19:59

Note that the mask hardware could be simplified slightly:

// active_cogs is a 16-bit vector representing the active state of each of the cogs
function [3:0] GetHubMask(input [15:0] active_cogs);
	return {|active_cogs[15:8], |active_cogs[15:4], active_cogs[15:2], 1'b1};
endfunction

Which would give timing like:

COG #  | 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
Active |                         x  x  x  x  x  x  x  x
       |             x  x  x  x
       |       x  x
       | x  x
n      | 1  1  2  2  3  3  3  3  4  4  4  4  4  4  4  4
2^n    | 2  2  4  4  8  8  8  8 16 16 16 16 16 16 16 16

This is predicated on the fact that, at best, hubops can't happen any more frequently than every 2 clock cycles (the effective instruction speed), so there's no point in cycling the hub window any more frequently than that either.

jmg · 2014-05-05 20:34

Seairth wrote: »

Now, what do you get for this scheme? You get a hub access scheme that will ALWAYS give every cog access every 16 clock cycles (just as the P1+ currently does), but may also give up to the first 8 cogs access every 8 clock cycles, give up to the first 4 cogs access every 4 clock cycles, give up to the first 2 cogs access every 2 clock cycles, and give up to the first cog access every clock cycle.

All of this is ok, but I'm not really seeing any reason to constrain this to 2.4.8.16

A priority encoder (http://en.wikipedia.org/wiki/Priority_encoder) is trivial, and easily fast enough.
A priority encoder is identical to your case, when the user chooses 2.4.8.16 and a superset in all other cases.
ie it totally includes the Power of 2 coverage.

The PLL has been fixed to no longer be limited to powers of 2 - following that lead, when giving control, why not make it LSB granular ?

Seairth · 2014-05-05 21:10

jmg wrote: »

All of this is ok, but I'm not really seeing any reason to constrain this to 2.4.8.16

A priority encoder (http://en.wikipedia.org/wiki/Priority_encoder) is trivial, and easily fast enough.
A priority encoder is identical to your case, when the user chooses 2.4.8.16 and a superset in all other cases.
ie it totally includes the Power of 2 coverage.

The PLL has been fixed to no longer be limited to powers of 2 - following that lead, when giving control, why not make it LSB granular ?

Because, like the current P1+ design, I can write/use a driver that is optimized for 16 clock hub timing on any cog and in any combination of cogs without any change in behavior. Just like the current P1+.

With the priority encoder, if I understand correctly, in order to use that driver, I must force at least one of the active cogs to be #16. And now, all other cogs only get 16 cycle hub timing.

Whereas, with the above approach, you could have up to 7 cogs that can use 8 clock hub timing and the 8th cog would still safely run that driver without any change in its timing or behavior.

The obvious trade-off here is that you would be limited to the first 8 cogs (still, in any combination) in order to get the 8 clock hub timing. All of these schemes have trade-offs. They're just different. Not necessarily any better or worse.

jmg · 2014-05-05 22:11

Seairth wrote: »

With the priority encoder, if I understand correctly, in order to use that driver, I must force at least one of the active cogs to be #16. And now, all other cogs only get 16 cycle hub timing.

Not quite, ALL binary cases are covered using Priority encoder, as well as additional cases of 3,5,6,7,9,10,11,12,13,14,15.

If you want a 8 clock model, just make sure you use COG #8, and avoid COG9..16 => Identical operation to Binary.
( Binary already has avoid-rules on hub placement )

potatohead · 2014-05-05 22:26

I'm still reading, but do not apologize for length. You attempted and succeeded at great communication.

IMHO, nobody should ever apologize for that.

Cluso99 · 2014-05-05 23:44

The problem that I see in any change in the number of slots available in the round-robbin loop, is that Objects designed for 1:16 can fail if they do not get 1:16 slots. It does not matter if it is 1:4, 1:8, 1:9, 1:15 or anything else other than 1:16 (or multiples like 2:32 but spaced at 1:16) may fail. The programmer is relying on that specific 1:16 case!

For instance, 1:8 gives 2:16. But the programmer may have carefully crafted his code to know that each hub occurs at precisely 1:16. He does not need to insert nops if he hasn't any useful instructions. Now if that hubop occurs early (as in 1:8 or anything else for that matter) then his timing is thrown out. He may be using this to sample incoming bits (I have done this, but it's not in obex). Now it fails.

So, there are indeed cases where we have to be careful to preserve the 1:16 format.

I am just wanting you to understand the ramifications of not maintaining the 1:16. I don't disagree that some simple form of superset cannot be used beneficially, just that it's not going to be used by many. It is however possible that a high volume user would not care - he is just interested in getting his project working, and he is likely to understand the ramifications and deal with them accordingly.

jmg · 2014-05-05 23:54

Cluso99 wrote: »

So, there are indeed cases where we have to be careful to preserve the 1:16 format.

I am just wanting you to understand the ramifications of not maintaining the 1:16.

Of course, in either power of 2, or Priority Encoder scan options, the 1:16 would be the default.
I implemented my test design with a separate boolean for this, so users do not even have to know the feature exists.

Something that relies by design on a fixed slot rate, likely also has strict fSYS, and both those important details would, of course, be documented in the header

Other designs, such as those that used WAITxx, or are simply data-rate limited, could specify some minimum Slot rate, and fSys/32 is looking like a useful minimum.

Seairth · 2014-05-06 04:47

jmg wrote: »

Not quite, ALL binary cases are covered using Priority encoder, as well as additional cases of 3,5,6,7,9,10,11,12,13,14,15.

If you want a 8 clock model, just make sure you use COG #8, and avoid COG9..16 => Identical operation to Binary.
( Binary already has avoid-rules on hub placement )

Ahh. Yes, I get what you are saying. My apologies. Using your approach, my 16 clock hub timed driver would still work as long as the top cog used is a power-of-two.

So, the basic differences between your approach and mine are:

TopUsedCog

TopUsedCog is more granular (i.e. hub timings from 1 clock cycle up to 16 clock cycles)
TopUsedCog requires you to explicitly set the top cog to control the timing.
Starting cog #16 will result in the same timing behavior for all cogs as the current P1+ design.
If using an OBEX driver that is tuned for 16 clock timing, then the developer must make sure the top cog is 2^n.
If you need to ensure consistent timing, regardless of which cogs are used, TopUsedCog would need an enable/disable bit.

Power-of-two

2^n is less granular (i.e. clock timing is always a power-of-two)
2^n requires that you explicitly set the top cog to be within one of 5 ranges (1, 2, 3-4, 5-8, 9-16) in order to set timing.
Starting at least one cog from 9 to 16 will result in the same timing behavior for all cogs as the current P1+.
If using an OBEX driver that is tuned for 16 clock timing, then it will work at all times.
If you need to ensure consistent timing, regardless of which cogs are used, design for 16 clock timing.

RossH · 2014-05-06 05:17

A problem common to all these schemes is that they assume that users always use cogs starting from zero, and increasingly monotonically. But this is simply not the case. I have some programs that use specific cogs (cog 0 or cog 7 usually - at least on the P1) and other programs that start and stop cogs dynamically, or in response to external events - so although they may only ever use (say) 5 cogs overall, they do not always use cogs 0,1,2,3,4 - they may just as easily be using 2,4,5,6,7.

And just as there are programs that use sequential cogs to minimize latency, it is reasonable to think some programs might want to use cogs on opposite sides of the hub to minimize jitter (e.g. on the P8X32A it would be cogs 0 & 4 or 3 & 7, whereas on the P16X32B it would be cogs 0 & 8 or 7 & 15).

These cases would further complicate both the TopUsedCog and Power-of-two schemes, unless there was an additional level of indirection added.

Ross.

Brian Fairchild · 2014-05-06 05:21

RossH wrote: »

I have some programs that use specific cogs...

And isn't that going to become more common with the P1+? IIRC there will be some close-coupling of cores to pins for certain functions so as to tie a core to a given set of pins. So it may well be that, to simplify board layout, cores are used in an 'odd' order.

RossH · 2014-05-06 05:27

Brian Fairchild wrote: »

And isn't that going to become more common with the P1+? IIRC there will be some close-coupling of cores to pins for certain functions so as to tie a core to a given set of pins. So it may well be that, to simplify board layout, cores are used in an 'odd' order.

Good point!

Seairth · 2014-05-06 05:50

Yes, I stated all of that in the initial post (see "how does this compare..."). This approach (I'm not speaking for TopUsedCog; go talk about that one elsewhere) has limited utility when you must use specific cogs. Other approaches would certainly be better suited. But, as none of these approaches are implemented in the P1+ right now, I think that excluding this approach because some applications wouldn't be able to take advantage of it is "throwing the baby out with the bath water". I am not arguing that this approach is superior to any of the others (quite the opposite, actually), but it might be the most practical at this point in time. Or not. I leave that to Chip to decide.

kwinn · 2014-05-06 16:26

This is where a lookup table based assignment would work well. You could assign a cog to a specific time slot in the table. Could have cog1 and cog2 assigned to hub slot 1 and 8 or 3 and 10 or..... for jitter free timing. If cogs were not used their slots could be assigned to cogs that are in use, or only counting down from/up to the last cog in use and thereby distributing those slots between the cogs in use.

RossH wrote: »

A problem common to all these schemes is that they assume that users always use cogs starting from zero, and increasingly monotonically. But this is simply not the case. I have some programs that use specific cogs (cog 0 or cog 7 usually - at least on the P1) and other programs that start and stop cogs dynamically, or in response to external events - so although they may only ever use (say) 5 cogs overall, they do not always use cogs 0,1,2,3,4 - they may just as easily be using 2,4,5,6,7.

And just as there are programs that use sequential cogs to minimize latency, it is reasonable to think some programs might want to use cogs on opposite sides of the hub to minimize jitter (e.g. on the P8X32A it would be cogs 0 & 4 or 3 & 7, whereas on the P16X32B it would be cogs 0 & 8 or 7 & 15).

These cases would further complicate both the TopUsedCog and Power-of-two schemes, unless there was an additional level of indirection added.

Ross.

jmg · 2014-05-06 17:07

kwinn wrote: »

This is where a lookup table based assignment would work well. You could assign a cog to a specific time slot in the table. Could have cog1 and cog2 assigned to hub slot 1 and 8 or 3 and 10 or..... for jitter free timing. If cogs were not used their slots could be assigned to cogs that are in use, or only counting down from/up to the last cog in use and thereby distributing those slots between the cogs in use.

True, I'll copy some of my post from the other thread, that relates to this.

A lookup table also needs a 5b Reload field, which I think you are saying above too.

A 32x Table can load in an atomic way, using a WRQUAD opcode equivalent.
Nibble level atomic access may also be needed, but that is not complex.

Another key advantage of an (atomic access WRQUAD) table of 4b fields, is any CogID can go anywhere.
This solves/avoids the all-COGS-are-not-actually-quite-equal issues some have raised.

I am looking at a good some planning and care use case right now, of PAL Video and USB time domain budgets.

With a Locked 1:16 I get to about the 5th line, and it pretty much drops dead, without complex gymnastics.

Key point: HUB Slot Scan Rate (SSR) is not just about Bandwitdh is also impacts tight code-loop granularity.

With a COG map, Table+Reload, I can get many lines further.

eg I have a MOD 28 scanner, a PAL fSYS at 38x Burst, (1.684775125e8) and I have a 12MHz USB Data Sync DPLL that can lock +/- 0.285 % with a centre point that is ~ 15ppm off.

I have 2 'fast' COGS running @ 20ns SRR & 14 'slow' cogs with @ 140ns SSR, both jitter free.

(the fast and slow refer only to the SSR, in all other aspects COGS are of course identical)

The Fast COG SSR, can co-operate with my DPLL in a way that is impossible with a locked 1:16 rate.
Other SW issues may occur, but the timing budget moves to 'possible' on this simple example.

Some of the many possible operational Table+Reload choices are

COGS No, @ Slot Sample Rate, some coverage options, Any-COGid allowed
 2 @ 20ns SRR(fast) & 14 @ 160ns SSR(slow) 
[B] 2 @ 20ns SRR(fast) & 14 @ 140ns SSR(slow) [/B]
 3 @ 30ns SRR(fast) & 12 @ 120ns SSR(slow) 
 4 @ 40ns SRR(fast) & 12 @ 120ns SSR(slow)
 5 @ 50ns SRR(fast) & 10 @ 100ns SSR(slow)
 2 @ 20ns SRR(fast) & 10 @ 100ns SSR(slow)
 reference Default  is 16 @ 80ns SSR (slow)

Unallocated COGs are not in the Hub Scan, but they can still be used for other tasks.

kwinn · 2014-05-06 18:45

There is no doubt in my mind that a look up table with a programmable reload counter is the simplest and most versatile method of assigning hub slots to cogs. I wonder though if it would be worth doubling the size of the look up table to give a wider range of time slots. There are applications that do need hub access, but at a fairly low rate. PWM is one that comes to mind. The frequencies are often in the low KHz range or less, and changes at even lower rates. One access out of 64 would be more than adequate for that.

jmg · 2014-05-06 21:18

kwinn wrote: »

There is no doubt in my mind that a look up table with a programmable reload counter is the simplest and most versatile method of assigning hub slots to cogs.

Agreed.

kwinn wrote: »

I wonder though if it would be worth doubling the size of the look up table to give a wider range of time slots. There are applications that do need hub access, but at a fairly low rate. PWM is one that comes to mind. The frequencies are often in the low KHz range or less, and changes at even lower rates. One access out of 64 would be more than adequate for that.

There is no 'hard' ceiling to this, but one appeal of 32x is a WRQUAD can load/change it in one atomic line.
I think 'another address alias' may also be needed to allow atomic nibble level changes - for example, I could see cases where two clusters of 8 Cogs, do some fancy run-time Alloc sharing within each group, and you want safe atomic changes for that to work free of interactions.

I guess if Table2 is added for Cluso's mooch (pipelined), it could alternately make Table1 larger ?

Hub timing using a power-of-two scheme

Comments