Shop OBEX P1 Docs P2 Docs Learn Events
HUB RAM interface question - Page 7 — Parallax Forums

HUB RAM interface question



  • @TonyB_ said:

    @evanh said:
    Uh-oh, umm, FIFO writes to hubRAM are not looking friendly at all. My guess is delayed writes aren't happening. I presume they're allowed but don't happen in practice. Measurements for BYTE lines are twice as bad as SHORT lines, which are in turn twice as bad as LONG lines. We're reaching up to 2 million ticks to SETQ2+WRLONG block write 64 kLW while streamer is also writing - for index 16 of all! That's 32 ticks per longword for the block write.

    PS: No change if the block write is changed to a block read with SETQ2+RDLONG.

    Conceptually, from the hub RAM's viewpoint, is there any difference between fast block read and FIFO read? If the answer is yes, why is fast block read worse than FIFO read? If no, why is FIFO write worse than fast block write?

    I think there was an effort to minimise the amount of dirty data held in the write fifo. In order to do this, it has to write every byte as is comes in. Also, the fifo can't assume that the program will write whole longs. And it may not have started on an aligned address. There is no fifo flush command, only an option when startring a new wrfast/rdfast transfer.

    For my ntsc input program, I experimented with oversampling. It didn't go well. Even though writing bytes at twice the rate as longs is half the data rate. The normal rate is about 1/22.8, with slight variation as it locks onto the input signal. It might have taken 4x rate to cause problems. I came to the conclusion that there was no advantage to doing a byte wrfast compared to a long wrfast. It would mean a small savings when I read the data back in with a block read (16 longs).

    What I don't understand is why it takes longer when writing bytes or words. I would expect that fifo writes place the same load on the hub regardless of the size.

  • evanhevanh Posts: 16,281
    edited 2022-02-22 05:34

    @SaucySoliton said:
    I think there was an effort to minimise the amount of dirty data held in the write fifo. In order to do this, it has to write every byte as is comes in. Also, the fifo can't assume that the program will write whole longs. And it may not have started on an aligned address. There is no fifo flush command, only an option when startring a new wrfast/rdfast transfer.

    Right, just issue a fresh WRFAST #0,#0. It doesn't complete until prior data is written. That's well understood.

    What I don't understand is why it takes longer when writing bytes or words. I would expect that fifo writes place the same load on the hub regardless of the size.

    It looks like it's as you've said above, writes are written ASAP. Therefore WFBYTE's go straight through as bytes. WFWORDs possibly have a little more complicated buffering so they can be written as aligned shortwords. And WFLONGs likewise would need aligned.

    The one big plus of this approach, and is I'm sure why Chip has done it this way, is subsequent hubRAM reads will be up-to-date. It gives certainty that dirty data doesn't grow old waiting for the six longwords to fill ... If such extra buffering was implemented.

    It does make sense. Just a rather large trade-off. The FIFO is a bus hog when writing to hubRAM.

  • evanhevanh Posts: 16,281

    If FIFO write buffering was implemented it'd need a timer to ensure the buffer auto-flushed when not filling quickly.

    Maybe that's where the oddities are with FIFO reading from hubRAM too. Chip has tried to balance freshness with efficiency.

  • evanhevanh Posts: 16,281
    edited 2022-02-22 06:30

    There's two simple ways of timing out:

    • One is reset the timeout on every new WFxxxx. This would have a lower trip count, like 16 clock cycles.
    • The other is timeout starts at the first WFxxxx of empty/flushed buffer. The timer ignores further input. This would have a higher trip count, like 128 clock cycles.

    In both cases, a burst write (flush) to hubRAM occurs if the buffer fills to high-level or timeout trips.

  • Based on Evan's test results for FIFO writes, time taken by fast block move of N longs when FIFO is concurrently writing to hub RAM is shown below, where long/word/byte T is time when FIFO writing longs/words/bytes.

    For sysclk divisor D,

    If D < 9, long T = DN, word T = 2DN, byte T = 4DN
    (except byte T = 1.333DN if D = 8.5)

    If D >= 9, then long T = word T = byte T
    T = N/(1 - C/DB)
    where C = 8 and B = 1, thus
    T = N/(1 - 8/D)

    D >= 9 equation is same as here:

  • TonyB_TonyB_ Posts: 2,204
    edited 2022-02-23 11:58

    @SaucySoliton said:
    What I don't understand is why it takes longer when writing bytes or words. I would expect that fifo writes place the same load on the hub regardless of the size.

    Assuming I've interpreted Evan's results correctly, writing bytes or words takes longer only above a certain streamer frequency.

    Fast move time when FIFO writing longs at sysclk/2 = 2N, which is excellent and the best possible. However, sysclk/3 = 3N, sysclk/4 = 4N, etc. up to sysclk/9 (double or quadruple these N values for words or bytes). In plain English, as streamer speed decreases fast move time increases, which makes no sense.

    From sysclk/9+, as streamer speed decreases fast move time also decreases and is the same whether FIFO is writing longs, words or bytes.

  • evanhevanh Posts: 16,281

    Yep, calculations come out exactly as I suspected for the slower rates (higher dividers). Each hubRAM write is a single write of byte, shortword or longword as per the streamer's action.

     BYTE   50   051eb852    96376    96376    96376    96376    96376    96376    96376    96376    96376    96376
    SHORT   50   051eb852    96376    96376    96376    96376    96376    96376    96376    96376    96376    96376
     LONG   50   051eb852    96376    96376    96376    96376    96376    96376    96376    96376    96376    96376
    C (number of cogs) = 8
    D (effective divisor) = 100 (BYTE), 50 (SHORT), 25 (LONG)
    Tu (Unimpeded Ticks) = 65536
    Tm (Measured Ticks) = 96376
    Te (Extra ticks) = Tm - Tu    Eg: 96376 - 65536 = 30840,
    S (Stalls) = Te / C           Eg: 30840 / 8 = 3855     ,
    Lb (FIFO burst length in longwords) = (C / D) * Tm / Te  =>  Lb = (C / D) * Tm / (Tm - Tu)  =>  Lb = C / (D * (1 - Tu/Tm))
    Lbb = 8 / (100 * (1 - 65536 / 96376)) = 0.25
    Lbs = 8 / ( 50 * (1 - 65536 / 96376)) = 0.5
    Lbl = 8 / ( 25 * (1 - 65536 / 96376)) = 1
  • evanhevanh Posts: 16,281
    edited 2022-02-23 12:01

    At smaller dividers there is actual bursts greater than one at a time. However, I believe the bursts are as per the streamer write widths: WFBYTE/WFWORD/WFLONG. Eg: If the streamer is doing WFBYTEs then the FIFO's hubRAM write bursts are byte sized too.

    So we get lopsided hogging, eg:

     BYTE    6   2aaaaaaa   786392   786400   786400   786400   786408   786408   786416   786416   786408   786416
    SHORT    6   2aaaaaaa   393192   393200   393200   393200   393200   393200   393200   393208   393208   393200
     LONG    6   2aaaaaaa   196616   196616   196592   196600   196600   196600   196600   196600   196600   196600
    C (number of cogs) = 8
    D (effective divisor) = 12 (BYTE), 6 (SHORT), 3 (LONG)
    Tu (Unimpeded Ticks) = 65536
    Tm (Measured Ticks) = 786400, 393200, 196616
    Te (Extra ticks) = Tm - Tu    Eg: 786400 - 65536 = 720864,  393200 - 65536 = 327664,  196616 - 65536 = 131080
    S (Stalls) = Te / C           Eg: 720864 / 8 = 90108     ,  327664 / 8 = 40958     ,  131080 / 8 = 16385
    Lb (FIFO burst length in longwords) = (C / D) * Tm / Te  =>  Lb = (C / D) * Tm / (Tm - Tu)  =>  Lb = C / (D * (1 - Tu/Tm))
    Lbb = 8 / ( 12 * (1 - 65536 / 786400)) = 0.727
    Lbs = 8 / (  6 * (1 - 65536 / 393200)) = 1.6
    Lbl = 8 / (  3 * (1 - 65536 / 196616)) = 4
  • evanhevanh Posts: 16,281

    Which means each value in the FIFO write buffer also has an associated tag that says what size it is. And it won't be packed. Each WFBYTE takes a longword of the FIFO.

  • YanomaniYanomani Posts: 1,524
    edited 2022-02-23 13:46

    IIRC, the only "tags" available are the four "byte write control lanes", which are the same ones the WMLONG instruction makes use of, in order to be able to discriminate/handle which of the four bytes are $00 (if any), in order to avoid overwriting meaningfull information that must be kept at the Hub memory.

    IOW, apart from being valuable to be used by WMLONG, the same "write control lanes" are useful to "encapsulate" bytes and words (AKA, shorts), though they're not affected by the information they contain ($00, or not), and those items will ever occupy a full long, so the same will be true for the "time slot" they'll take to get accomplished, at full.

  • TonyB_TonyB_ Posts: 2,204
    edited 2022-02-23 14:41

    @evanh said:
    At smaller dividers there is actual bursts greater than one at a time.

    Yes and the burst length B converges to 1 quite rapidly as sysclk divisor D increases. If cogs = 8, then

    long B = 8/(D-1) or 1
    word B = 8/(D-0.5) or 1
    byte B = 8/(D-0.25) or 1

    As burst cannot be less than one, B = 1 if quotients above < 1. All of long/word/byte B converge to 1 by sysclk/9.

  • TonyB_TonyB_ Posts: 2,204
    edited 2022-02-23 23:14

    Re FIFO writing to hub RAM:
    I think the cog tries to write the byte/word/long immediately, but it still has to wait for the correct slice to come around. At faster streamer speeds this means buffering writes and in word mode maybe writing one long, not two words separately, if both words have arrived in time.

    Re FIFO reading from hub RAM:
    The T = 9N results could be explained if the streamer and fast move slices are locked together. Fast write to slice X has to wait one rev because of FIFO read from X. One cycle later, fast write to X+1 has to wait one rev because of FIFO read from X+1, etc., etc., so that every fast write takes nine cycles instead of one. Only solution to the problem could be knowing when it happens and avoiding it, therefore eight tests, one for each possible slice difference, should be done for each streamer frequency.

  • evanhevanh Posts: 16,281

    @TonyB_ said:
    As burst cannot be less than one, B = 1 if quotients above < 1. All of long/word/byte B converge to 1 by sysclk/9.

    The calculation comes out as less because it represents longword bursts, but bursts are in longwords only for WFLONG. WFBYTE bursts are byte wide and WFWORD bursts are shortword wide.

    That's where implementing delayed buffering would be making a change. It would do packing into longwords to improve the burst ratio efficiency.. Bring it inline with the FIFO's reading of hubRAM.

  • evanhevanh Posts: 16,281

    @TonyB_ said:
    Re FIFO writing to hub RAM:
    I think the cog tries to write the byte/word/long immediately, but it still has to wait for the correct slice to come around. At faster streamer speeds this means buffering writes and in word mode maybe writing one long, not two words separately, if both words have arrived in time.

    Hmm, yeah, it would have to pack to fit the slot timing. I guess that explains the wild differences between dividers. Some will pack some won't.

  • evanhevanh Posts: 16,281
    edited 2022-02-24 04:42

    @Yanomani said:
    IIRC, the only "tags" available are the four "byte write control lanes", which are the same ones the WMLONG instruction makes use of, in order to be able to discriminate/handle which of the four bytes are $00 (if any), in order to avoid overwriting meaningfull information that must be kept at the Hub memory.

    Sure, the FIFO will use the byte controls in the write to hubRAM but that's not the tags it needs for tracking the data within the FIFO. The tags I'm thinking about will be fully automatic and deeply entrenched inside the FIFO.

    EDIT: It may just be one extra (33rd) bit, per buffer stage, that says the content is packed and aligned longwords. Everything else can be unpacked and use some of the other 32 bits to encode the various other cases, namely size.

  • evanhevanh Posts: 16,281

    One new data point - Aligned vs unaligned writes makes a difference in only one case: And maybe surprisingly, it's only a WFBYTE, But it hints at why index 17 is the only index that doesn't fit into either group.

    Report using longword address alignment:

     BYTE   17   0f0f0f10   742688   742656   742600   742584   742808   742776   742744   742720   742688   742656
    SHORT   17   0f0f0f10  1114064  1114080  1114088  1114112   123784  1114016  1114032  1114048  1114064  1114080
     LONG   17   0f0f0f10   557024   557032   557040   557040   557056   557000   557008   557016   557024   557032

    And report using odd hubRAM addresses:

     BYTE   17   0f0f0f10  1114024  1114056  1114072  1113872  1114104  1114072  1114040  1114008  1114024  1114056
    SHORT   17   0f0f0f10  1114064  1114080  1114088  1114088   123784  1114016  1114032  1114048  1114064  1114080
     LONG   17   0f0f0f10   557024   557032   557040   557040   557040   557064   557008   557016   557024   557032
  • TonyB_TonyB_ Posts: 2,204
    edited 2022-02-24 10:44

    @evanh said:

    @TonyB_ said:
    As burst cannot be less than one, B = 1 if quotients above < 1. All of long/word/byte B converge to 1 by sysclk/9.

    The calculation comes out as less because it represents longword bursts

    No, there are three slightly different equations for B for long/word/byte and they all are < 1 by sysclk/9 (actually long B = 1 at /9).

    That's where implementing delayed buffering would be making a change. It would do packing into longwords to improve the burst ratio efficiency.. Bring it inline with the FIFO's reading of hubRAM.

    Yes, FIFO should operate differently in streamer and non-streamer modes.

  • evanhevanh Posts: 16,281
    edited 2022-02-24 11:51

    The calculations I've used aren't any different for writes vs reads. Just that I get less than 1.0 for burst length because single byte writes count as 0.25.

    My testing is only using the streamer for FIFO interactions. When I say things like WFBYTE in this conversation I'm still talking about the streamer's actions. You'll note Chip also refers to the different streamer modes this way.

    The idea of delayed write buffering is something that Chip did not implement. I'm just pointing out how such a system would probably affect FIFO behaviour if implemented in future designs.

    EDIT: I initially assumed delayed write buffering was implemented - but later realised it isn't so -

  • SaucySolitonSaucySoliton Posts: 528
    edited 2022-02-25 02:18

    Going on the assumption that each byte write queues separately in the FIFO, that means that we are writing the same hub slice 4 times in a row. It's a terrible recipie for starvation. What's more, the FIFO is going to overflow and loose data.

    For the case wfbyte /4, it takes 25 clocks to write the data from 16 clocks. I wonder if FIFO overflows can be detected using GETPTR. If not, that's scary.

    EDIT: Wouldn't this problem show up in the HyperRAM driver?

  • evanhevanh Posts: 16,281
    edited 2022-02-25 02:45

    The FIFO must start packing bytes into shortwords once there is enough buffered. And that'll be why there is two distinct FIFO behaviours for writing to hubRAM - With a narrow transition at index 17.

    Eg: As per your sequence, at cycle 16 there is an extra byte buffered. That can be packed with the next byte arriving on an even hubRAM address and written to hubRAM as 16 bits wide instead of 8 bits.

  • cgraceycgracey Posts: 14,275

    Wow! You guys did a lot of work on this.

    I'm only on page 3 of 7, so far.

    Yes, it seems there are some unfortunate harmonic relationships between FIFO consumption and SETQ+RDLONG operations.

    This will need to be fixed in silicon to improve timing. Not sure if you guys have come up with a one-size-fits-all solution to this yet. Maybe always reading 8 longs?

    Will read more later tonight...

  • cgraceycgracey Posts: 14,275
    edited 2022-02-25 04:51

    Okay. All caught up.

    In FIFO write mode, each FIFO level has 36 bits:

    • 4 bits for byte-write flags
    • 32 bits for the long data going to the hub, which are handled as four bytes

    Any opportunity to write bytes or a word from the FIFO is taken ASAP, causing the FIFO to not pop if there is still 1-3 bytes unfilled, yet. Likewise, the FIFO doesn't push until all four byte slots are filled, making a complete long. There are also timing issues around hub long alignment, where data in the FIFO may be straddling two separate hub longs. So, data is packed in the FIFO.

  • evanhevanh Posts: 16,281
    edited 2022-02-25 05:29

    @cgracey said:
    This will need to be fixed in silicon to improve timing. Not sure if you guys have come up with a one-size-fits-all solution to this yet. Maybe always reading 8 longs?

    The averages indicate the FIFO's minimum burst read of hubRAM can be lower than six longwords. Identifying if that's possible is the first thing on my radar.

  • evanhevanh Posts: 16,281
    edited 2022-02-25 05:55

    If less than six is possible then that's the #1 fix. Ensure the minimum burst is six.
    If it is already minimum six then we need to work what is happening to appear as less.

  • Another possibility: allow blockmove to reorder the longs.

    If the FIFO buffers the writes into longs, that provides some assurance that hub slices busy now won't be busy next time around.

    This would need a lot of simulation to ensure it works well at all divisors.

  • cgraceycgracey Posts: 14,275

    @SaucySoliton said:
    Another possibility: allow blockmove to reorder the longs.

    If the FIFO buffers the writes into longs, that provides some assurance that hub slices busy now won't be busy next time around.

    This would need a lot of simulation to ensure it works well at all divisors.

    That's an interesting approach. Seems like it might really speed things up, but could be complicated for SETQ+WRLONG, where out-of-order cog RAM reads would be needed, causing more latencies, maybe ruining the possibility. Cog RAM writes have no latencies, though.

  • @cgracey said:

    @SaucySoliton said:
    Another possibility: allow blockmove to reorder the longs.

    If the FIFO buffers the writes into longs, that provides some assurance that hub slices busy now won't be busy next time around.

    This would need a lot of simulation to ensure it works well at all divisors.

    That's an interesting approach. Seems like it might really speed things up, but could be complicated for SETQ+WRLONG, where out-of-order cog RAM reads would be needed, causing more latencies, maybe ruining the possibility. Cog RAM writes have no latencies, though.

    I think re-ordering block moves is tricky and it's putting the cart before the horse. The FIFO algorithm should be improved to avoid any need to re-order block moves.

  • Out of order reads/writes for block moves for SETQ+WR/RDLONG should be used only if absolutely required, e.g. when the streamer is running simultanously. Otherwise this could ruin the use of fast block moves for atomic access to structures. But I don't know if out-of-order access to hub memory makes any sense, anyway. Out-of-order processing of FIFO transfers or reads/writes to cog ram should have no impact, though.

  • TonyB_TonyB_ Posts: 2,204
    edited 2022-02-25 13:09

    @cgracey said:
    This will need to be fixed in silicon to improve timing. Not sure if you guys have come up with a one-size-fits-all solution to this yet. Maybe always reading 8 longs?

    I think the best solution could be:

    FIFO reading from hub RAM:
    read a burst of 8 longs when needed

    FIFO writing to hub RAM:
    if streamer using FIFO write a burst of 8 longs when needed
    if streamer not using FIFO write long/word/byte ASAP.

    (Bursts could end up > 8 at higher streamer speeds.)

  • evanhevanh Posts: 16,281

    @TonyB_ said:
    if streamer running write a burst of 8 longs when needed
    if streamer not running write long/word/byte ASAP.

    That may not be so easy to determine since the streamer could be running but is not the one using the FIFO.

Sign In or Register to comment.