sysclock/5 is a shocker. It's as bad, in absolute terms, as sysclock/2 is!
sysclock/10 has similar relative losses but that's half as bad in absolutes.
And sysclock/6 is in the opposite position as the only case that comes out exceeding expected best case.
Thanks for the extra tests, Evan. Not studied them fully yet. I anticipated the worst result would be for an odd divisor. I think sysclk/5 is the fastest streamer speed that uses FIFO burst = 6 or 7 and it would be much better if burst = 8.
I've worked out the equation for total cycle times and I'll post it later.
Measured times:
sysclk/4 and faster as expected for burst = 8
sysclk/5 is 54% more than expected for burst = 8
sysclk/6 as expected for burst = 8
sysclk/8 is 3.5% more than expected for burst = 7
sysclk/10 is 8% more than expected for burst = 6 but as expected for burst = 4
sysclk/16 and slower as expected for burst = 6
What divisors do you use for sysclk/5 and sysclk/10? Did you add 1 to the fraction?
sysclk/8 is 2.5% slower than theoretical fastest possible with burst = 8, not a major issue. As I said earlier, having a fixed FIFO burst = 8 (for 8 cogs) would give the maximum fast move bandwidth across the board. The hub RAM and FIFO interface should be thought of in terms of egg beater revolutions.
@TonyB_ said:
I've worked out the equation for total cycle times and I'll post it later.
Define
B = Burst length for FIFO in clock cycles
C = Cogs
D = streamer Divisor, e.g. D = 4 for sysclk/4
N = fast move longs
S = Stalls
T = Time in clock cycles
Assuming each stall lasts one egg beater revolution:
T = N + SC
S = T / DB
T = N + TC/DB
T(1 - C/DB) = N
T = N/(1 - C/DB)
B = C/D(1 - N/T)
Example 1:
N = 131072, C = 8, D = 64, B = 6,
T = 133861 calculated (133870 measured)
Example 2:
N = 131072, C = 8, D = 6, B = 8,
T = 157286 calculated (157294 measured)
T excludes variable wait above one cycle for first fast move long. There might also be a random phase difference between fast move and streamer hub RAM slices in practice (but not in Evan's tests).
For P2 synchronous RMII based Ethernet the transfer rate via streamer will be one of these, so do these need to be specifically tested to see if there are any problems?
@150MHz 2 bits read at sysclk/3 rate, 1 long read at sysclk/48 rate
@200MHz 2 bits read at sysclk/4 rate, 1 long read at sysclk/64 rate
@250MHz 2 bits read at sysclk/5 rate, 1 long read at sysclk/80 rate
@300MHz 2 bits read at sysclk/6 rate, 1 long read at sysclk/96 rate
@350MHz 2 bits read at sysclk/7 rate, 1 long read at sysclk/112 rate
The first case is potentially marginal to keep up with the received data but let's assume it could work.
The last case is a high P2 overclock. The P2 will get hot but might still work.
The 150MHz, 250MHz, 350MHz will all generate non 50% duty cycle clocks, 150MHz is the extreme 33:67 case.
The middle 3 frequencies are sweet spots for both video and PSRAM/HyperRAM applications, especially 250MHz for VGA resolution over HDMI. In networking applications the external memory may come in handy for buffering, or for capture, or for larger IP stacks if we get external code working.
B = FIFO Burst length in clock cycles
C = number of Cogs
D = streamer Divisor, e.g. D = 4 for sysclk/4
N = Number of fast move longs
S = number of Stalls
T = Total time in clock cycles
Assuming each stall lasts one egg beater period (B ≤ 8), then
T = N + SC
S = T / DB
T = N + TC/DB
T(1 - C/DB) = N T = N / (1 - C/DB)
Good work, this should be tested against more experimentally tested values to see how it stacks up (or has that already been done?)
@rogloh said:
For P2 synchronous RMII based Ethernet the transfer rate via streamer will be one of these, so do these need to be specifically tested to see if there are any problems?
@150MHz 2 bits read at sysclk/3 rate, 1 long read at sysclk/48 rate
Streaming longs at sysclk/48 would add 3% to concurrent fast block move timings.
@rogloh said:
For P2 synchronous RMII based Ethernet the transfer rate via streamer will be one of these, so do these need to be specifically tested to see if there are any problems?
@150MHz 2 bits read at sysclk/3 rate, 1 long read at sysclk/48 rate
Streamer running at sysclk/48 would add 3% to concurrent fast block move timings.
Here is a table of measured fast block move timings when streamer is running at various fractions of sysclk, e.g. a block of N longs will take 1.09N cycles to fast move at sysclk/16. (Timings less accurate for small values of N.)
The /5 behaviour must be some weird harmonic (egg)beating effect where more HUB window transfer slots are lost due to address hits or something else weird stalling due to the FIFO size (some clock cycle is wasted). Maybe in these cases, 2 egg beater windows are lost to the FIFO, not just the one.
@rogloh said:
The /5 behaviour must be some weird harmonic (egg)beating effect where more HUB window transfer slots are lost due to address hits or something else weird stalling due to the FIFO size (some clock cycle is wasted). Maybe in these cases, 2 egg beater windows are lost to the FIFO, not just the one.
sysclk/5 is strange and hard to work out. If egg beater revolution is F for fast move and S for streamer, then instead of FFFFS ... it is close to FFFSS FFSSS ...
sysclk/10 matches calculations if burst = 4 and stall length= 1 rev (or 8 and 2 but how is that possible?)
sysclk/8 does not quite match integer burst calculation, for some reason.
It would be useful to have data for /7 and /9 and /12 and perhaps /11.
I think they need to be scaled by 1/2. I will redo with 16bits instead of 32, and repeat the /5 to match the original....and post here in a few mins. Edit: done
sysclk/5 = 252062
sysclk/7 = 183510
sysclk/9 = 1179598 !!! I redid this twice
sysclk/11 = 180230
As a reference I used this (for sysclk/11):
xfrq long $1745_d174 ' sysclock/11
xmod long DM_16bRF | $ffff
@rogloh said:
I think they need to be scaled by 1/2. I will redo with 16bits instead of 32, and repeat the /5 to match the original....and post here in a few mins. Edit: done
sysclk/5 = 252062
sysclk/7 = 183510
sysclk/9 = 1179598 !!! I redid this twice
sysclk/11 = 180230
As a reference I used this (for sysclk/11):
xfrq long $1745_d174 ' sysclock/11
xmod long DM_16bRF | $ffff
Thanks for the results. Why did Evan use words, not longs, for streamer size? This affects the FIFO.
EDIT:
I've been assuming streamer longs. Fast block bandwidth could be half what I thought it was. Bedtime
@TonyB_ said:
Thanks for the results. Why did Evan use words, not longs, for streamer size? This affects the FIFO.
I know, it's weird. In his test code he tests like this... the sysclk divisor is actually 3 with this number and with 4bit transfers it would be another factor of 8 to create a long so IMO this is really one long every sysclk/24 clocks not syclck/12.
xfrq long $5555_5555 ' sysclock/12
xmod long DM_4bRFle | $ffff ' 1/8 forever
Comments
The measured ticks is the tell tale. sysclock/2 (streaming @ 16-bit per clock cycle) measurement climbs dramatically.
I was happy with 16K block size but you changed it to 128K.
Yep, it's harder workings for me and less round outcomes too but I'd already done a bunch of them before it was sinking in.
Here's some more for /3, /5, /6 and /10:
EDIT: Updated equations for multiples of divider.
Thanks for the extra tests, Evan. Not studied them fully yet. I anticipated the worst result would be for an odd divisor. I think sysclk/5 is the fastest streamer speed that uses FIFO burst = 6 or 7 and it would be much better if burst = 8.
sysclock/5 is just weird. It should be at least as good as sysclock/4.
For some reason it's not achieving hubRAM bursts of six longwords. It's even worse than two longwords at a time!
I've worked out the equation for total cycle times and I'll post it later.
Measured times:
sysclk/4 and faster as expected for burst = 8
sysclk/5 is 54% more than expected for burst = 8
sysclk/6 as expected for burst = 8
sysclk/8 is 3.5% more than expected for burst = 7
sysclk/10 is 8% more than expected for burst = 6 but as expected for burst = 4
sysclk/16 and slower as expected for burst = 6
What divisors do you use for sysclk/5 and sysclk/10? Did you add 1 to the fraction?
$3333_3333 for both those two. Then changed the pins data width from 16-bit to 8-bit to get /10
Just wondering whether $3333_3334 would make any difference.
sysclk/8 is 2.5% slower than theoretical fastest possible with burst = 8, not a major issue. As I said earlier, having a fixed FIFO burst = 8 (for 8 cogs) would give the maximum fast move bandwidth across the board. The hub RAM and FIFO interface should be thought of in terms of egg beater revolutions.
No difference at all. sysclock/5 measures at 252062 ticks with both dividers.
Thanks for trying. sysclk/3 is excellent.
I'm more than happy with everything except sysclock/5.
Total time will vary by a small number of cycles if there is a variable WAITX of 0 to 7 between starting streamer and the fast block moves.
Not with my code. I've got that locked using the double WRLONG.
Define
B = Burst length for FIFO in clock cycles
C = Cogs
D = streamer Divisor, e.g. D = 4 for sysclk/4
N = fast move longs
S = Stalls
T = Time in clock cycles
Assuming each stall lasts one egg beater revolution:
T = N + SC
S = T / DB
T = N + TC/DB
T(1 - C/DB) = N
T = N/(1 - C/DB)
B = C/D(1 - N/T)
Example 1:
N = 131072, C = 8, D = 64, B = 6,
T = 133861 calculated (133870 measured)
Example 2:
N = 131072, C = 8, D = 6, B = 8,
T = 157286 calculated (157294 measured)
T excludes variable wait above one cycle for first fast move long. There might also be a random phase difference between fast move and streamer hub RAM slices in practice (but not in Evan's tests).
For P2 synchronous RMII based Ethernet the transfer rate via streamer will be one of these, so do these need to be specifically tested to see if there are any problems?
The first case is potentially marginal to keep up with the received data but let's assume it could work.
The last case is a high P2 overclock. The P2 will get hot but might still work.
The 150MHz, 250MHz, 350MHz will all generate non 50% duty cycle clocks, 150MHz is the extreme 33:67 case.
The middle 3 frequencies are sweet spots for both video and PSRAM/HyperRAM applications, especially 250MHz for VGA resolution over HDMI. In networking applications the external memory may come in handy for buffering, or for capture, or for larger IP stacks if we get external code working.
Good work, this should be tested against more experimentally tested values to see how it stacks up (or has that already been done?)
Streaming longs at sysclk/48 would add 3% to concurrent fast block move timings.
Hardly an issue then.
Here is a table of measured fast block move timings when streamer is running at various fractions of sysclk, e.g. a block of N longs will take 1.09N cycles to fast move at sysclk/16. (Timings less accurate for small values of N.)
Based on tests done by Evan posted above. Burst is FIFO burst length in cycles for which calculated times match actual times.
The /5 behaviour must be some weird harmonic (egg)beating effect where more HUB window transfer slots are lost due to address hits or something else weird stalling due to the FIFO size (some clock cycle is wasted). Maybe in these cases, 2 egg beater windows are lost to the FIFO, not just the one.
sysclk/5 is strange and hard to work out. If egg beater revolution is F for fast move and S for streamer, then instead of FFFFS ... it is close to FFFSS FFSSS ...
sysclk/10 matches calculations if burst = 4 and stall length= 1 rev (or 8 and 2 but how is that possible?)
sysclk/8 does not quite match integer burst calculation, for some reason.
It would be useful to have data for /7 and /9 and /12 and perhaps /11.
I just modified evanh's program and got this: seems to be off by 2x?
Edit: Ok I was doing 32 bit transfers. Maybe he was assuming 16bits.
sysclk/7 = 344062
sysclk/9 = 235934
sysclk/11 = 205974
Are these for 128K longs? /7 is terrible. As a check, Evan's /5 = 252062.
I think they need to be scaled by 1/2. I will redo with 16bits instead of 32, and repeat the /5 to match the original....and post here in a few mins. Edit: done
sysclk/5 = 252062
sysclk/7 = 183510
sysclk/9 = 1179598 !!! I redid this twice
sysclk/11 = 180230
As a reference I used this (for sysclk/11):
xfrq long $1745_d174 ' sysclock/11
xmod long DM_16bRF | $ffff
deleted
Thanks for the results. Why did Evan use words, not longs, for streamer size? This affects the FIFO.
EDIT:
I've been assuming streamer longs. Fast block bandwidth could be half what I thought it was. Bedtime
I know, it's weird. In his test code he tests like this... the sysclk divisor is actually 3 with this number and with 4bit transfers it would be another factor of 8 to create a long so IMO this is really one long every sysclk/24 clocks not syclck/12.
Holy cow! sysclock/9 exploded!
yep, 16-bit for most. sometimes 8-bit for a quick halving.