Shop OBEX P1 Docs P2 Docs Learn Events
HUB RAM interface question - Page 2 — Parallax Forums

HUB RAM interface question

2456789

Comments

  • evanhevanh Posts: 15,192

    @evanh said:
    I've done /2 and /4 above. The summary being, everything down to sysclock/4 flowed well. Plenty of available hubRAM access.

    The measured ticks is the tell tale. sysclock/2 (streaming @ 16-bit per clock cycle) measurement climbs dramatically.

  • @evanh said:

    @evanh said:
    I've done /2 and /4 above. The summary being, everything down to sysclock/4 flowed well. Plenty of available hubRAM access.

    The measured ticks is the tell tale. sysclock/2 (streaming @ 16-bit per clock cycle) measurement climbs dramatically.

    I was happy with 16K block size but you changed it to 128K.

  • evanhevanh Posts: 15,192
    edited 2022-02-18 14:53

    Yep, it's harder workings for me and less round outcomes too but I'd already done a bunch of them before it was sinking in.

    Here's some more for /3, /5, /6 and /10:

    • sysclock/5 is a shocker. It's as bad, in absolute terms, as sysclock/2 is!
    • sysclock/10 has similar relative losses but that's half as bad in absolutes.
    • And sysclock/6 is in the opposite position as the only case that comes out exceeding expected best case.
     (1 + 8 / (3 * 6))  * 131072 = 189326
     (1 + 8 / (3 * 5))  * 131072 = 200977
      sysclock/3  measured ticks = 196622
    
     (1 + 8 / (5 * 6))  * 131072 = 166025
     (1 + 8 / (5 * 2))  * 131072 = 235930
     (1 + 8 / (5 * 1))  * 131072 = 340787
      sysclock/5  measured ticks = 252062
    
     (1 + 8 / (6 * 6))  * 131072 = 160199
     (1 + 8 / (6 * 7))  * 131072 = 156038
      sysclock/6  measured ticks = 157294
    
     (1 + 8 / (10 * 6)) * 131072 = 148548
     (1 + 8 / (10 * 4)) * 131072 = 157286
     (1 + 8 / (10 * 3)) * 131072 = 166025
      sysclock/10 measured ticks = 163846
    

    EDIT: Updated equations for multiples of divider.

  • TonyB_TonyB_ Posts: 2,127
    edited 2022-02-18 12:13

    @evanh said:

    sysclock/5 is a shocker. It's as bad, in absolute terms, as sysclock/2 is!
    sysclock/10 has similar relative losses but that's half as bad in absolutes.
    And sysclock/6 is in the opposite position as the only case that comes out exceeding expected best case.

    Thanks for the extra tests, Evan. Not studied them fully yet. I anticipated the worst result would be for an odd divisor. I think sysclk/5 is the fastest streamer speed that uses FIFO burst = 6 or 7 and it would be much better if burst = 8.

  • evanhevanh Posts: 15,192
    edited 2022-02-18 14:25

    sysclock/5 is just weird. It should be at least as good as sysclock/4.

     (1 + 8 / (4 * 6))  * 131072 = 174763
      sysclock/4  measured ticks = 174774
    

    For some reason it's not achieving hubRAM bursts of six longwords. It's even worse than two longwords at a time!

     (1 + 8 / (5 * 6))  * 131072 = 166025
     (1 + 8 / (5 * 2))  * 131072 = 235930
     (1 + 8 / (5 * 1))  * 131072 = 340787
      sysclock/5  measured ticks = 252062
    
  • TonyB_TonyB_ Posts: 2,127
    edited 2022-02-18 22:33

    @evanh said:
    sysclock/5 is just weird. It should be at least as good as sysclock/4.

     (1 + 8 / (4 * 6))  * 131072 = 174763
      sysclock/4  measured ticks = 174774
    

    I've worked out the equation for total cycle times and I'll post it later.

    Measured times:
    sysclk/4 and faster as expected for burst = 8
    sysclk/5 is 54% more than expected for burst = 8
    sysclk/6 as expected for burst = 8
    sysclk/8 is 3.5% more than expected for burst = 7
    sysclk/10 is 8% more than expected for burst = 6 but as expected for burst = 4
    sysclk/16 and slower as expected for burst = 6

    What divisors do you use for sysclk/5 and sysclk/10? Did you add 1 to the fraction?

  • evanhevanh Posts: 15,192
    edited 2022-02-18 15:09

    @TonyB_ said:
    What divisors do you use for sysclk/5 and sysclk/10? Did you add 1 to the fraction?

    $3333_3333 for both those two. Then changed the pins data width from 16-bit to 8-bit to get /10

  • @evanh said:

    @TonyB_ said:
    What divisors do you use for sysclk/5 and sysclk/10? Did you add 1 to the fraction?

    $3333_3333 for both those two. Then changed the pins data width from 16-bit to 8-bit to get /10

    Just wondering whether $3333_3334 would make any difference.

  • sysclk/8 is 2.5% slower than theoretical fastest possible with burst = 8, not a major issue. As I said earlier, having a fixed FIFO burst = 8 (for 8 cogs) would give the maximum fast move bandwidth across the board. The hub RAM and FIFO interface should be thought of in terms of egg beater revolutions.

  • evanhevanh Posts: 15,192

    No difference at all. sysclock/5 measures at 252062 ticks with both dividers.

  • @evanh said:
    No difference at all. sysclock/5 measures at 252062 ticks with both dividers.

    Thanks for trying. sysclk/3 is excellent.

  • evanhevanh Posts: 15,192

    @TonyB_ said:
    ... The hub RAM and FIFO interface should be thought of in terms of egg beater revolutions.

    I'm more than happy with everything except sysclock/5.

  • @evanh said:

    @TonyB_ said:
    ... The hub RAM and FIFO interface should be thought of in terms of egg beater revolutions.

    I'm more than happy with everything except sysclock/5.

    Total time will vary by a small number of cycles if there is a variable WAITX of 0 to 7 between starting streamer and the fast block moves.

  • evanhevanh Posts: 15,192

    Not with my code. I've got that locked using the double WRLONG.

  • TonyB_TonyB_ Posts: 2,127
    edited 2022-02-20 19:16

    @TonyB_ said:
    I've worked out the equation for total cycle times and I'll post it later.

    Define

    B = Burst length for FIFO in clock cycles
    C = Cogs
    D = streamer Divisor, e.g. D = 4 for sysclk/4
    N = fast move longs
    S = Stalls
    T = Time in clock cycles

    Assuming each stall lasts one egg beater revolution:

    T = N + SC
    S = T / DB

    T = N + TC/DB
    T(1 - C/DB) = N
    T = N/(1 - C/DB)

    B = C/D(1 - N/T)

    Example 1:
    N = 131072, C = 8, D = 64, B = 6,
    T = 133861 calculated (133870 measured)

    Example 2:
    N = 131072, C = 8, D = 6, B = 8,
    T = 157286 calculated (157294 measured)

    T excludes variable wait above one cycle for first fast move long. There might also be a random phase difference between fast move and streamer hub RAM slices in practice (but not in Evan's tests).

  • roglohrogloh Posts: 5,170
    edited 2022-02-18 23:38

    For P2 synchronous RMII based Ethernet the transfer rate via streamer will be one of these, so do these need to be specifically tested to see if there are any problems?

    • @150MHz 2 bits read at sysclk/3 rate, 1 long read at sysclk/48 rate
    • @200MHz 2 bits read at sysclk/4 rate, 1 long read at sysclk/64 rate
    • @250MHz 2 bits read at sysclk/5 rate, 1 long read at sysclk/80 rate
    • @300MHz 2 bits read at sysclk/6 rate, 1 long read at sysclk/96 rate
    • @350MHz 2 bits read at sysclk/7 rate, 1 long read at sysclk/112 rate

    The first case is potentially marginal to keep up with the received data but let's assume it could work.

    The last case is a high P2 overclock. The P2 will get hot but might still work.

    The 150MHz, 250MHz, 350MHz will all generate non 50% duty cycle clocks, 150MHz is the extreme 33:67 case.

    The middle 3 frequencies are sweet spots for both video and PSRAM/HyperRAM applications, especially 250MHz for VGA resolution over HDMI. In networking applications the external memory may come in handy for buffering, or for capture, or for larger IP stacks if we get external code working.

  • @TonyB_ said:
    Define

    B = FIFO Burst length in clock cycles
    C = number of Cogs
    D = streamer Divisor, e.g. D = 4 for sysclk/4
    N = Number of fast move longs
    S = number of Stalls
    T = Total time in clock cycles

    Assuming each stall lasts one egg beater period (B ≤ 8), then

    T = N + SC
    S = T / DB

    T = N + TC/DB
    T(1 - C/DB) = N
    T = N / (1 - C/DB)

    Good work, this should be tested against more experimentally tested values to see how it stacks up (or has that already been done?)

  • TonyB_TonyB_ Posts: 2,127
    edited 2022-02-20 19:24

    @rogloh said:
    For P2 synchronous RMII based Ethernet the transfer rate via streamer will be one of these, so do these need to be specifically tested to see if there are any problems?

    • @150MHz 2 bits read at sysclk/3 rate, 1 long read at sysclk/48 rate

    Streaming longs at sysclk/48 would add 3% to concurrent fast block move timings.

  • @TonyB_ said:

    @rogloh said:
    For P2 synchronous RMII based Ethernet the transfer rate via streamer will be one of these, so do these need to be specifically tested to see if there are any problems?

    • @150MHz 2 bits read at sysclk/3 rate, 1 long read at sysclk/48 rate

    Streamer running at sysclk/48 would add 3% to concurrent fast block move timings.

    Hardly an issue then.

  • TonyB_TonyB_ Posts: 2,127
    edited 2022-02-19 00:11

    Here is a table of measured fast block move timings when streamer is running at various fractions of sysclk, e.g. a block of N longs will take 1.09N cycles to fast move at sysclk/16. (Timings less accurate for small values of N.)

    Sysclk  Timing  Burst
    /2      2.0     8
    /3      1.5     8
    /4      1.33    8
    /5      1.92    
    /6      1.2     8
    /8      1.17
    /10     1.25    4
    /16     1.09    6
    /32     1.04    6
    /64     1.02    6
    /128    1.01    6
    /256    1.005   6
    /512    1.003   6
    

    Based on tests done by Evan posted above. Burst is FIFO burst length in cycles for which calculated times match actual times.

  • roglohrogloh Posts: 5,170
    edited 2022-02-19 00:06

    The /5 behaviour must be some weird harmonic (egg)beating effect where more HUB window transfer slots are lost due to address hits or something else weird stalling due to the FIFO size (some clock cycle is wasted). Maybe in these cases, 2 egg beater windows are lost to the FIFO, not just the one.

  • TonyB_TonyB_ Posts: 2,127
    edited 2022-02-19 00:30

    @rogloh said:
    The /5 behaviour must be some weird harmonic (egg)beating effect where more HUB window transfer slots are lost due to address hits or something else weird stalling due to the FIFO size (some clock cycle is wasted). Maybe in these cases, 2 egg beater windows are lost to the FIFO, not just the one.

    sysclk/5 is strange and hard to work out. If egg beater revolution is F for fast move and S for streamer, then instead of FFFFS ... it is close to FFFSS FFSSS ...

    sysclk/10 matches calculations if burst = 4 and stall length= 1 rev (or 8 and 2 but how is that possible?)

    sysclk/8 does not quite match integer burst calculation, for some reason.

    It would be useful to have data for /7 and /9 and /12 and perhaps /11.

  • roglohrogloh Posts: 5,170
    edited 2022-02-19 00:43

    I just modified evanh's program and got this: seems to be off by 2x?

    Edit: Ok I was doing 32 bit transfers. Maybe he was assuming 16bits.

    sysclk/7 = 344062
    sysclk/9 = 235934
    sysclk/11 = 205974

  • @rogloh said:
    I just modified evanh's program and got this, doing the other ones will update...
    sysclk/7 = 344062
    sysclk/9 = 235934
    sysclk/11 = 205974

    Are these for 128K longs? /7 is terrible. As a check, Evan's /5 = 252062.

  • roglohrogloh Posts: 5,170
    edited 2022-02-19 00:57

    I think they need to be scaled by 1/2. I will redo with 16bits instead of 32, and repeat the /5 to match the original....and post here in a few mins. Edit: done
    sysclk/5 = 252062
    sysclk/7 = 183510
    sysclk/9 = 1179598 !!! I redid this twice
    sysclk/11 = 180230

    As a reference I used this (for sysclk/11):
    xfrq long $1745_d174 ' sysclock/11
    xmod long DM_16bRF | $ffff

  • TonyB_TonyB_ Posts: 2,127
    edited 2022-02-19 00:54

    deleted

  • TonyB_TonyB_ Posts: 2,127
    edited 2022-02-19 01:10

    @rogloh said:
    I think they need to be scaled by 1/2. I will redo with 16bits instead of 32, and repeat the /5 to match the original....and post here in a few mins. Edit: done
    sysclk/5 = 252062
    sysclk/7 = 183510
    sysclk/9 = 1179598 !!! I redid this twice
    sysclk/11 = 180230

    As a reference I used this (for sysclk/11):
    xfrq long $1745_d174 ' sysclock/11
    xmod long DM_16bRF | $ffff

    Thanks for the results. Why did Evan use words, not longs, for streamer size? This affects the FIFO.

    EDIT:
    I've been assuming streamer longs. Fast block bandwidth could be half what I thought it was. Bedtime

  • @TonyB_ said:
    Thanks for the results. Why did Evan use words, not longs, for streamer size? This affects the FIFO.

    I know, it's weird. In his test code he tests like this... the sysclk divisor is actually 3 with this number and with 4bit transfers it would be another factor of 8 to create a long so IMO this is really one long every sysclk/24 clocks not syclck/12.

    xfrq    long    $5555_5555      ' sysclock/12
    xmod    long    DM_4bRFle | $ffff   ' 1/8 forever
    
  • evanhevanh Posts: 15,192

    Holy cow! sysclock/9 exploded!

      unimpeded sysclock ticks = 131072
      sysclock/2  measured ticks = 262166
      sysclock/3  measured ticks = 196622
      sysclock/4  measured ticks = 174774
      sysclock/5  measured ticks = 252062
      sysclock/6  measured ticks = 157294
      sysclock/7  measured ticks = 183502
      sysclock/8  measured ticks = 153454
      sysclock/9  measured ticks = 1179598
      sysclock/10 measured ticks = 163846
      sysclock/11 measured ticks = 180230
      sysclock/12 measured ticks = 149798
      sysclock/13 measured ticks = 154910
      sysclock/14 measured ticks = 144870
      sysclock/15 measured ticks = 151246
      sysclock/16 measured ticks = 142998
      sysclock/17 measured ticks = 142238
      sysclock/18 measured ticks = 141566
      sysclock/19 measured ticks = 155654
      sysclock/20 measured ticks = 140438
    
  • evanhevanh Posts: 15,192

    @rogloh said:
    I just modified evanh's program and got this: seems to be off by 2x?

    Edit: Ok I was doing 32 bit transfers. Maybe he was assuming 16bits.

    yep, 16-bit for most. sometimes 8-bit for a quick halving.

Sign In or Register to comment.