Shop OBEX P1 Docs P2 Docs Learn Events
HUB RAM interface question - Page 3 — Parallax Forums

HUB RAM interface question

1356789

Comments

  • It seems to me that streaming longs at sysclk/2 with FIFO burst = 8 should leave half the hub RAM bandwidth available for fast moves. Every other egg beater rev for FIFO or fast move. We really need test results when streaming longs.

  • evanhevanh Posts: 15,183
    edited 2022-02-19 01:28

    I used 16-bit at sysclock/1 for same effect.

    It can have an off-by-one difference due to phase shift, that's all. Bandwidth is identical.

  • evanhevanh Posts: 15,183
    edited 2022-02-19 01:55

    Here's an update for repeated tests with large random delay to phase shift the streamer bursts. Also no longer hand code the NCO divider.

    xmod    long    DM_16bRF | $ffff    ' 1/2 forever
    wlen    long    $1ffff          ' 128 kLW (512 kByte)
    
    _main
            qfrac   #1, #12     ' calculate divider
            rdfast  #0, #0      ' burst length 6, so 12 * 6 = 72 ticks between bursts
            getqx   pa
            setq    pa      ' set NCO divider
            xinit   xmod, #0
    
            mov bcdlen, #8
            call    #itoh       ' print NCO divider
            call    #putnl
            mov bcdlen, #10
    
    .loop
            getrnd  pa
            zerox   pa, #14
            waitx   pa      ' random pause to bring out difference from burst coincidence tally
    
            wrlong  pa, #1      ' align first GETCT for zero gap in start time
            getct   ticks
            setq    wlen
            wrlong  #0, #0
            getct   pa
            sub pa, ticks
    
            call    #itod
            jmp #.loop
    
  • @evanh said:
    I used 16-bit at sysclock/1 for same effect.

    It can have an off-by-one difference due to phase shift, that's all. Bandwidth is identical.

    It would be best to double check it to be sure there's not some weirdness going on...

  • evanhevanh Posts: 15,183
    edited 2022-02-19 01:50

    I have. That's why I said it affects the phase. The random start delay just above does the same.

  • roglohrogloh Posts: 5,155
    edited 2022-02-19 01:53

    Sysclk/9 is way worse for 16 bits vs 32 bits. The transfer size is also having an effect somehow.

  • evanhevanh Posts: 15,183
    edited 2022-02-19 01:58

    @rogloh said:
    Sysclk/9 is way worse for 16 bits vs 32 bits. The transfer size is also having an effect somehow.

    32-bit is sysclock/4.5 Bandwidth is doubled. To compensate you'd want to /2 (>>1) on the NCO divider.

  • evanhevanh Posts: 15,183
    edited 2022-02-19 02:13

    hmmm, there is differences ... I wonder if the implied RFBYTE vs RFWORD vs RFLONG do impact available hubRAM bandwidth ..

    PS: Most streamer ops are going to be RFBYTE and WFBYTE.

  • evanhevanh Posts: 15,183
    edited 2022-02-19 04:03

    Exactly the same binary run twice in a row. Bad cases aren't limited to sysclock/5 or /9. EDIT: Uh, block length is 64kLW here.

     BYTE   3   55555555    78648    78648    78648    78640    78640    78640    78640    78640    78640    78648    78640    78640
    SHORT   3   55555555    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304
     LONG   3   55555555    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304
    
     BYTE   3   55555555    78640    78648    78648    78640    78648    78640    78640    78640    78648    78640    78648    78648
    SHORT   3   55555555   589768   589784   589784   589784   589824   589800   589800   589784   589776   589800   589824   589776
     LONG   3   55555555    98304    98304    98312    98312    98304    98304    98304    98304    98304    98312    98304    98304
    
  • evanhevanh Posts: 15,183
    edited 2022-02-20 12:28

    Looping test code now
    Update: added random streamer (FIFO) start address to encourage more oddball cases
    Update2: adjust start address for longword granularity
    This one is buggy. Newest source code here - https://forums.parallax.com/discussion/comment/1535803/#Comment_1535803

  • What's so special about 16 bits that makes it so much worse? Somehow varies from run to run as well. :o Something syncing up badly affecting things from then on. Maybe there is something weird about the test setup, but it seemed ok.

  • evanhevanh Posts: 15,183
    edited 2022-02-19 05:17

    With more randomising it's just as bad for lordwords too. Only byte-wise seems immune.

    It's notable that sysclock/5 is afflicted the same as /3 and /9 but the severity is milder.
    EDIT: Ah, and I'm seeing it on sysclock/7 too. Just milder still.
    EDIT2: Also for sysclock/6

  • Hopefully nibbles and two bit data won't get affected badly, they are useful for streaming to QSPI and RMII devices.

  • evanhevanh Posts: 15,183

    @rogloh said:
    Hopefully nibbles and two bit data won't get affected badly, they are useful for streaming to QSPI and RMII devices.

    They should be fine since they all use the same implied RFBYTE and WFBYTE.

  • @evanh said:
    They should be fine since they all use the same implied RFBYTE and WFBYTE.

    Famous last words... :smile: They should be but ARE they? I guess they can't be written to RAM until at least a full byte is already filled so I'll take your word for it (for now).

  • evanhevanh Posts: 15,183
    edited 2022-02-19 05:28

    @rogloh said:
    Somehow varies from run to run as well. :o Something syncing up badly affecting things from then on. Maybe there is something weird about the test setup, but it seemed ok.

    Ya, dunno how but it's gotta be purely down to the FIFO's burst length. It ain't consistent.

  • evanhevanh Posts: 15,183
    edited 2022-02-19 05:30

    @rogloh said:

    @evanh said:
    They should be fine since they all use the same implied RFBYTE and WFBYTE.

    Famous last words... :smile: They should be but ARE they? I guess they can't be written to RAM until at least a full byte is already filled so I'll take your word for it (for now).

    I do think it's time for Chip to carefully look at the Verilog. I think we're looking at fixes for future silicon.

  • @evanh said:
    I do think it's time for Chip to carefully look at the Verilog. I think we're looking at fixes for future silicon.

    Hmm... Did I speak too soon about feeling we should be grateful for a good design there? Hopefully nothing bad in there, just some weird harmonic interaction with addresses killing hub throughput maybe, but it would be good to pinpoint an exact case that performs really poorly every time and run it by Chip to see why.

  • evanhevanh Posts: 15,183
    edited 2022-02-19 06:13

    @rogloh said:
    ... but it would be good to pinpoint an exact case that performs really poorly every time and run it by Chip to see why.

    I could log the randomising values used to see if there is common alignments. Might not be obvious at all ... No, there's no way that's enough info. The effect was happening anyway, the random address just brought it out some more.

  • roglohrogloh Posts: 5,155
    edited 2022-02-19 06:21

    It would be good to know that data is reliable when this strange effect is happening. I wonder if a particular test pattern could be streamed out and read back in from the pins, in another COG - might be a PITA to setup at test for this though.

    It would be bad if some extra bytes were being transferred due to something wrapping around when it shouldn't be. That might give the appearance of extra hub cycle transfers etc.

    If this is a bug, there is a chance it could be corrupting the transfers.

  • evanhevanh Posts: 15,183

    Now you're hyperventilating.

  • LOL. not really.

  • evanhevanh Posts: 15,183
    edited 2022-02-19 13:28

    Writing up the situation for Chip and I noticed the fact that my inner loop doesn't restart the streamer. I saw it as a consistency feature at the time but now I think it'd be better if streamer was restarted for each test loop ...

    And using the loop count as the FIFO start address creates repeatable changing cases across each line. And, no surprise, using random FIFO start address creates random cases across the line.

    Revised program attached
    Update: Reintroduced reporting of NCO divider

  • @evanh said:
    Exactly the same binary run twice in a row. Bad cases aren't limited to sysclock/5 or /9. EDIT: Uh, block length is 64kLW here.

     BYTE   3   55555555    78648    78648    78648    78640    78640    78640    78640    78640    78640    78648    78640    78640
    SHORT   3   55555555    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304
     LONG   3   55555555    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304
    
     BYTE   3   55555555    78640    78648    78648    78640    78648    78640    78640    78640    78648    78640    78648    78648
    SHORT   3   55555555   589768   589784   589784   589784   589824   589800   589800   589784   589776   589800   589824   589776
     LONG   3   55555555    98304    98304    98312    98312    98304    98304    98304    98304    98304    98312    98304    98304
    

    Please can we keep to BYTE, WORD and LONG?

    Are there some results I could see for various D for sysclk/D long streaming, sysclk/D word streaming & sysclk/D byte streaming?

  • Thanks. Please add names above columns so Chip and the rest of us know what's what and please please change SHORT to WORD. Is this fast move writes and streamer reads? Have you tested fast reads and streamer writes? Or possibly read,read and write,write.

    Presumably LONG sysclk/2 time = eternity?

  • evanhevanh Posts: 15,183

    @TonyB_ said:
    Presumably LONG sysclk/2 time = eternity?

    yes.

    Those are just the emitted reports as is. I've not added any frills. The description is in the prior post to that link.

  • @evanh said:

    @TonyB_ said:
    Presumably LONG sysclk/2 time = eternity?

    yes.

    Those are just the emitted reports as is. I've not added any frills. The description is in the prior post to that link.

    Fast block write and FIFO read, then. What is last column?

    I'm wondering whether fast read and FIFO write would be the same. Also, are 12 tests really needed? Could be only eight, one for each slice difference.

  • evanhevanh Posts: 15,183
    edited 2022-02-19 12:25

    Oh, the last column is new, they're tally's for that line of results >= 80000. Not useful below sysclock/6. It was mainly a way to quickly see something in the mass of larger dividers.

    Twelve is just how many I had from earlier. It might show repeats, I haven't looked.

    EDIT: Actually, it has highlighted sysclock/3 as an exception where there is variation in its BYTE line now.

  • TonyB_TonyB_ Posts: 2,123
    edited 2022-02-19 12:32

    @evanh said:

    @TonyB_ said:
    Presumably LONG sysclk/2 time = eternity?

    yes.

    Does this mean no cog access at all to hub RAM when streaming longs at sysclk/2?

Sign In or Register to comment.