Putting all this hub timing stuff into some sort of concrete context...
Seairth
Posts: 2,474
Trying to put some of this in context to code we are generally familiar with, let's look at FullDuplexSerial (my current favorite example code).
Since we are missing a few instruction timing details, let's assume that P1+ takes the same time to access the hub as the P2. Therefore:
* Writes are completed in a single instruction cycle
* Reads take an additional two clock cycles, so any instruction following it will stall by two clocks (one instruction cycle)
The hub-critical receive code contains the following bit:
Left unchanged, there would be 4 clocks between the RDLONG and WRBYTE, and there would be 6 clock cycles between WRBYTE and WRLONG.
The first pair would be ideally timed for 1:6 timing, while the second pair would be ideally timed for 1:8 timing.
The send block contains the following bit:
Here, there are 4 clock cycles between the two RDLONGs. Because the "jmp #transmit" results in a jmpret, the only timing consideration is the non-jumping condition, where there would be 8 clock cycles between the second RDLONG and RDBYTE. And then, there would be another 8 clock cycles between RDBYTE and WRLONG.
The ideal timings for these groups would be: 1:6, 1:10, 1:10.
(note: I'm ignoring the setup code that also accesses the hub. That bit is a one-time operation that does not require precise timing.)
So... how could the slots be tuned to this? Well, the easiest approach would be to give the cog 1:2 timing (effectively, no hub latency). Obviously, that's heavy-handed for many applications, so...
Though not as ideal, you could go with 1:4 timing. In the receive code, the WRBYTE would add two clock cycles (due to stall) over the 1:2 timing. The send code would end up with the second RDLONG, RDBYTE, and WRLONG each stalling for two clock cycles. This would still be quite managable, I think.
Now, if we went with 1:8 timing, you'd still get a stall for the WRBYTE in the receive code, but the overall timing would be no different than 1:4 timing. On the send code, we get a bit worse with the stalls. The second RDLONG stalls for 2 cycles. The RDBYTE and last WRLONG, however, would stall for 6 cycles each. Fortunately, the next three instructions that follow the above block could easily be moved before the last WRLONG to avoid the stall:
And, in this case, the fourth instruction could be moved before the RDBYTE to reduce that stall to 4 clock cycles. It would look like:
This would, I believe, result in exactly the same overall timing as the 1:4 scenarios (because the RDBYTE stall increases by two clocks, but the WRLONG decreases by two clocks).
Now, let's look at what the original code would do with the default 1:16 timing. Unchanged, the receive code would introduce 18 additional clock cylces of stall, and the receive code would introduce 22 clock cycles. You could re-arrange the send code like above, but that will only reduce the stalls to 14 clock cycles.
So, in summary:
Since we are missing a few instruction timing details, let's assume that P1+ takes the same time to access the hub as the P2. Therefore:
* Writes are completed in a single instruction cycle
* Reads take an additional two clock cycles, so any instruction following it will stall by two clocks (one instruction cycle)
The hub-critical receive code contains the following bit:
rdlong t2,par 'save received byte and inc head add t2,rxbuff wrbyte rxdata,t2 sub t2,rxbuff add t2,#1 and t2,#$0F wrlong t2,par
Left unchanged, there would be 4 clocks between the RDLONG and WRBYTE, and there would be 6 clock cycles between WRBYTE and WRLONG.
The first pair would be ideally timed for 1:6 timing, while the second pair would be ideally timed for 1:8 timing.
The send block contains the following bit:
transmit jmpret txcode,rxcode 'run a chunk of receive code, then return mov t1,par 'check for head <> tail add t1,#2 << 2 rdlong t2,t1 add t1,#1 << 2 rdlong t3,t1 cmp t2,t3 wz if_z jmp #transmit add t3,txbuff 'get byte and inc tail rdbyte txdata,t3 sub t3,txbuff add t3,#1 and t3,#$0F wrlong t3,t1
Here, there are 4 clock cycles between the two RDLONGs. Because the "jmp #transmit" results in a jmpret, the only timing consideration is the non-jumping condition, where there would be 8 clock cycles between the second RDLONG and RDBYTE. And then, there would be another 8 clock cycles between RDBYTE and WRLONG.
The ideal timings for these groups would be: 1:6, 1:10, 1:10.
(note: I'm ignoring the setup code that also accesses the hub. That bit is a one-time operation that does not require precise timing.)
So... how could the slots be tuned to this? Well, the easiest approach would be to give the cog 1:2 timing (effectively, no hub latency). Obviously, that's heavy-handed for many applications, so...
Though not as ideal, you could go with 1:4 timing. In the receive code, the WRBYTE would add two clock cycles (due to stall) over the 1:2 timing. The send code would end up with the second RDLONG, RDBYTE, and WRLONG each stalling for two clock cycles. This would still be quite managable, I think.
Now, if we went with 1:8 timing, you'd still get a stall for the WRBYTE in the receive code, but the overall timing would be no different than 1:4 timing. On the send code, we get a bit worse with the stalls. The second RDLONG stalls for 2 cycles. The RDBYTE and last WRLONG, however, would stall for 6 cycles each. Fortunately, the next three instructions that follow the above block could easily be moved before the last WRLONG to avoid the stall:
or txdata,#$100 'ready byte to transmit shl txdata,#2 or txdata,#1 mov txbits,#11 mov txcnt,cnt
And, in this case, the fourth instruction could be moved before the RDBYTE to reduce that stall to 4 clock cycles. It would look like:
transmit jmpret txcode,rxcode 'run a chunk of receive code, then return mov t1,par 'check for head <> tail add t1,#2 << 2 rdlong t2,t1 add t1,#1 << 2 rdlong t3,t1 cmp t2,t3 wz if_z jmp #transmit add t3,txbuff 'get byte and inc tail mov txbits,#11 rdbyte txdata,t3 sub t3,txbuff add t3,#1 and t3,#$0F or txdata,#$100 'ready byte to transmit shl txdata,#2 or txdata,#1 wrlong t3,t1 mov txcnt,cnt
This would, I believe, result in exactly the same overall timing as the 1:4 scenarios (because the RDBYTE stall increases by two clocks, but the WRLONG decreases by two clocks).
Now, let's look at what the original code would do with the default 1:16 timing. Unchanged, the receive code would introduce 18 additional clock cylces of stall, and the receive code would introduce 22 clock cycles. You could re-arrange the send code like above, but that will only reduce the stalls to 14 clock cycles.
So, in summary:
- At 1:2, not stalls but also hogs the hub. During the periods where the driver is bit-banging I/O, significant hub bandwidth would be unused.
- At 1:4, a bit less pressure on the hub and minimal stalling, very little latency transfering data between hub.
- At 1:8, even less pressure on the hub and exactly the same stalling as 1:4 (assuming the re-ordered instruction, as above) or slightly worse (if left as-is), slightly worse latency transfering data between hub
- A1 1:16, very little pressure on the hub, but signficantly more stalling and increased latency transfering data between hub
Comments
At the end of the day, this isn't much difference. So what does it mean for the original analysis? Well, for the overall FullDuplexSerial, that's hard to say. I calculated this with an idle receiver, so more realistic clock cycle counts are definitely going to be higher (and therefore, the baud rate will be lower). With so much of the "jmpret" code ping-ponging during the idle periods, it's conceivable that the above blocks of code could have a higher relative influence on the overall rate. Or not.
Still, using the above blocks of code as representative uses of HUBOPs, I think the original exercise still has value.
Does anyone have another small block of code that would be worth putting through the same analysis?
I have a use case combining PAL Video and USB, where 1:14 makes very good sense, and I can see 1:20 cases appealing to someone using 200MHz clocks and wanting 100ns slot spacings. (There are bound to be many others)
Reload has very very, low cost, just 5 bits.
Only one bit in my latest map case...
Depends how you count
I was meaning 5 bits in total, your design has one bit per table entry - but those are implementation details, once configured the run-result is the same.