Putting all this hub timing stuff into some sort of concrete context...

Seairth · 2014-05-07 08:55

Trying to put some of this in context to code we are generally familiar with, let's look at FullDuplexSerial (my current favorite example code).

Since we are missing a few instruction timing details, let's assume that P1+ takes the same time to access the hub as the P2. Therefore:

* Writes are completed in a single instruction cycle
* Reads take an additional two clock cycles, so any instruction following it will stall by two clocks (one instruction cycle)

The hub-critical receive code contains the following bit:

                        rdlong  t2,par                'save received byte and inc head
                        add     t2,rxbuff
                        wrbyte  rxdata,t2
                        sub     t2,rxbuff
                        add     t2,#1
                        and     t2,#$0F
                        wrlong  t2,par

Left unchanged, there would be 4 clocks between the RDLONG and WRBYTE, and there would be 6 clock cycles between WRBYTE and WRLONG.

The first pair would be ideally timed for 1:6 timing, while the second pair would be ideally timed for 1:8 timing.

The send block contains the following bit:

transmit                jmpret  txcode,rxcode         'run a chunk of receive code, then return

                        mov     t1,par                'check for head <> tail
                        add     t1,#2 << 2
                        rdlong  t2,t1
                        add     t1,#1 << 2
                        rdlong  t3,t1
                        cmp     t2,t3           wz
        if_z            jmp     #transmit

                        add     t3,txbuff             'get byte and inc tail
                        rdbyte  txdata,t3
                        sub     t3,txbuff
                        add     t3,#1
                        and     t3,#$0F
                        wrlong  t3,t1

Here, there are 4 clock cycles between the two RDLONGs. Because the "jmp #transmit" results in a jmpret, the only timing consideration is the non-jumping condition, where there would be 8 clock cycles between the second RDLONG and RDBYTE. And then, there would be another 8 clock cycles between RDBYTE and WRLONG.

The ideal timings for these groups would be: 1:6, 1:10, 1:10.

(note: I'm ignoring the setup code that also accesses the hub. That bit is a one-time operation that does not require precise timing.)

So... how could the slots be tuned to this? Well, the easiest approach would be to give the cog 1:2 timing (effectively, no hub latency). Obviously, that's heavy-handed for many applications, so...

Though not as ideal, you could go with 1:4 timing. In the receive code, the WRBYTE would add two clock cycles (due to stall) over the 1:2 timing. The send code would end up with the second RDLONG, RDBYTE, and WRLONG each stalling for two clock cycles. This would still be quite managable, I think.

Now, if we went with 1:8 timing, you'd still get a stall for the WRBYTE in the receive code, but the overall timing would be no different than 1:4 timing. On the send code, we get a bit worse with the stalls. The second RDLONG stalls for 2 cycles. The RDBYTE and last WRLONG, however, would stall for 6 cycles each. Fortunately, the next three instructions that follow the above block could easily be moved before the last WRLONG to avoid the stall:

                        or      txdata,#$100          'ready byte to transmit
                        shl     txdata,#2
                        or      txdata,#1
                        mov     txbits,#11
                        mov     txcnt,cnt

And, in this case, the fourth instruction could be moved before the RDBYTE to reduce that stall to 4 clock cycles. It would look like:

transmit                jmpret  txcode,rxcode         'run a chunk of receive code, then return

                        mov     t1,par                'check for head <> tail
                        add     t1,#2 << 2
                        rdlong  t2,t1
                        add     t1,#1 << 2
                        rdlong  t3,t1
                        cmp     t2,t3           wz
        if_z            jmp     #transmit

                        add     t3,txbuff             'get byte and inc tail

                        mov     txbits,#11
			
                        rdbyte  txdata,t3
                        sub     t3,txbuff
                        add     t3,#1
                        and     t3,#$0F

                        or      txdata,#$100          'ready byte to transmit
                        shl     txdata,#2             
                        or      txdata,#1             
			
                        wrlong  t3,t1                 

                        mov     txcnt,cnt

This would, I believe, result in exactly the same overall timing as the 1:4 scenarios (because the RDBYTE stall increases by two clocks, but the WRLONG decreases by two clocks).

Now, let's look at what the original code would do with the default 1:16 timing. Unchanged, the receive code would introduce 18 additional clock cylces of stall, and the receive code would introduce 22 clock cycles. You could re-arrange the send code like above, but that will only reduce the stalls to 14 clock cycles.

So, in summary:

At 1:2, not stalls but also hogs the hub. During the periods where the driver is bit-banging I/O, significant hub bandwidth would be unused.
At 1:4, a bit less pressure on the hub and minimal stalling, very little latency transfering data between hub.
At 1:8, even less pressure on the hub and exactly the same stalling as 1:4 (assuming the re-ordered instruction, as above) or slightly worse (if left as-is), slightly worse latency transfering data between hub
A1 1:16, very little pressure on the hub, but signficantly more stalling and increased latency transfering data between hub

Seairth · 2014-05-07 08:58

And, for jmg, I'll point out that running the cog in a 1:12 ratio (plus similar tuning) might also give very good performance. I'll let someone else work out the numbers, though.

Seairth · 2014-05-07 10:49

Though, to be fair, if you were just transmitting data, you would end up with about 440 clock cycles for the transmit loop (all eleven bits). By comparison, the transmit setup code (unaltered, with 1:16 hub timing) would take about 64 clock cycles. The setup accounts for ~13% of the transmit time. With the 1:8 hub timing (and above tweaks), the setup code drops to about 48 clocks, resulting in ~10% of the transmit time. And this difference is a ~25% drop in the timing of the setup code. Running at 200MHz, the original version might reach transmit speeds up to about 4.3Mbps and the other version might reach transmit speeds up to 4.5Mbps.

At the end of the day, this isn't much difference. So what does it mean for the original analysis? Well, for the overall FullDuplexSerial, that's hard to say. I calculated this with an idle receiver, so more realistic clock cycle counts are definitely going to be higher (and therefore, the baud rate will be lower). With so much of the "jmpret" code ping-ponging during the idle periods, it's conceivable that the above blocks of code could have a higher relative influence on the overall rate. Or not.

Still, using the above blocks of code as representative uses of HUBOPs, I think the original exercise still has value.

Does anyone have another small block of code that would be worth putting through the same analysis?

jmg · 2014-05-07 12:38

Seairth wrote: »

And, for jmg, I'll point out that running the cog in a 1:12 ratio (plus similar tuning) might also give very good performance. I'll let someone else work out the numbers, though.

I have a use case combining PAL Video and USB, where 1:14 makes very good sense, and I can see 1:20 cases appealing to someone using 200MHz clocks and wanting 100ns slot spacings. (There are bound to be many others)
Reload has very very, low cost, just 5 bits.

Bill Henning · 2014-05-07 12:43

jmg wrote: »

I have a use case combining PAL Video and USB, where 1:14 makes very good sense, and I can see 1:20 cases appealing to someone using 200MHz clocks and wanting 100ns slot spacings. (There are bound to be many others)
Reload has very very, low cost, just 5 bits.

Only one bit in my latest map case...

jmg · 2014-05-07 14:01

Bill Henning wrote: »

Only one bit in my latest map case...

Depends how you count

I was meaning 5 bits in total, your design has one bit per table entry - but those are implementation details, once configured the run-result is the same.

Putting all this hub timing stuff into some sort of concrete context...

Comments