Shop OBEX P1 Docs P2 Docs Learn Events
HUB RAM interface question — Parallax Forums

HUB RAM interface question

What would happen if I setup the streamer to read from hub ram at a rate of 1 byte per 16 clock cycles and while the streamer is running initiate a fast and long block move with a SETQ+WRLONG combination with a size of several hundred longs.

Will this work at all or is there a conflict between FIFO reads and block writes? If it works who will get priority? The block move will try to consume all the available bandwidth. The streamer should get priority to be able to continue to run but this would mess up egg-beater schedule and bring the bandwidth down from 1 long/cycle to around 1 long/9 cycles I fear.

«13456789

Comments

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-02-17 20:20

    @ManAtWork said:
    What would happen if I setup the streamer to read from hub ram at a rate of 1 byte per 16 clock cycles and while the streamer is running initiate a fast and long block move with a SETQ+WRLONG combination with a size of several hundred longs.

    Will this work at all or is there a conflict between FIFO reads and block writes? If it works who will get priority? The block move will try to consume all the available bandwidth. The streamer should get priority to be able to continue to run but this would mess up egg-beater schedule and bring the bandwidth down from 1 long/cycle to around 1 long/9 cycles I fear.

    I posed a similar question here, somewhat off-topic:
    https://forums.parallax.com/discussion/comment/1535106/#Comment_1535106

    Yes, it will work and the FIFO has priority. As 1 byte per 16 cycles is 1 long per 64 cycles, the block move should not have to yield to the FIFO very often. I don't know exactly how the FIFO re-fills itself and some testing of this would be very helpful but I can't spare the time at the moment.

    I hope the FIFO operates in bursts of N longs in N cycles, the same rate as fast block moves do if no yielding. Higher streamer speeds demand this and I doubt there is a different mode at slower speeds.

  • Ah thanks, yes that makes sense. If the FIFO uses bursts to fill up then there should be enough bandwidth left.

    Does it make a difference if the streamer runs byte or long reads (mode 1001 vs. 1011, for example)? I hope not. I think the FIFO should fill up with RFLONGs no matter how many bytes the streamer pulls out at once.

  • evanhevanh Posts: 16,039
    edited 2022-02-18 03:17

    My understanding is FIFO hubRAM accesses are 32-bit, so always at full bandwidth. The RFxxxx instructions, or equivalent streamer ops, interact with the FIFO as needed for the op.

    Which means there can be late writes to hubRAM. But then there can be way early reads too.

    EDIT: I guess a closing final write can be less than 32-bit.

    EDIT2: Yeah, so with the OP scenario, the FIFO will steal its single hubRAM slot but, as you've likely guessed, that will also disrupt the block write for 8 clock cycles because the block write will be ordered and forced to miss a slot.

    That's an interesting mental exercise. The FIFO is the one that has the most flexibility but it's also the one with priority so that very flexibility isn't being taken advantage of ... unless it does delay until needing a whole rotation of slots ... I don't think it waits that long though.

    EDIT3: I had that wrong. The FIFO pulls in a minimum of six longwords in a burst. Much better than I guessed.

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-02-17 21:32

    @ManAtWork said:
    Does it make a difference if the streamer runs byte or long reads (mode 1001 vs. 1011, for example)? I hope not. I think the FIFO should fill up with RFLONGs no matter how many bytes the streamer pulls out at once.

    Well, it makes a big difference in terms of hub RAM bandwidth consumed by the streamer for byte vs. long reads, but as Evan says FIFO <-> hub RAM moves are always in longs. You should not have a problem for your example.

  • evanhevanh Posts: 16,039
    edited 2022-02-17 22:59

    Oh, reading the silicon doc:

    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages. These metrics ensure that the FIFO never underflows, under all potential reading scenarios.

    If I'm reading that right, it is actually saying at least 6 longwords at a time. And maybe more if the flow rate is high. That's a nice improvement on the single longs I was thinking it might be.

    PS: Nothing in the doc about FIFO writes to hubRAM though.

    EDIT: Ah, there is this little hint:

    If a cog has been writing to the hub via WRFAST, and it wants to immediately COGSTOP itself, a 'WAITX #20' should be executed first, in order to allow time for any lingering FIFO data to be written to the hub.

    What that implies is FIFO writes do auto-flush after some time without needing an explicit RDFAST or WRFAST reissued. Which in-turn implies delayed writes are modus operandi.

  • @ManAtWork said:
    Ah thanks, yes that makes sense. If the FIFO uses bursts to fill up then there should be enough bandwidth left.

    If you try this could you time how long the fast block write takes?

  • evanhevanh Posts: 16,039

    @TonyB_ said:
    If you try this could you time how long the fast block write takes?

    It has to be writing in ridged address order. Anything else is too complicated. Which implies an eight clock cycle extension on each missed slot ...

  • @ManAtWork said:
    What would happen if I setup the streamer to read from hub ram at a rate of 1 byte per 16 clock cycles and while the streamer is running initiate a fast and long block move with a SETQ+WRLONG combination with a size of several hundred longs.

    Will this work at all or is there a conflict between FIFO reads and block writes? If it works who will get priority?

    Already replied to by others, but I know this works for real as I do exactly this in my video driver when I prepare a next scan line worth of data pixels and write it back using large SETQ2 burst writes to HUB while the streamer + FIFO is steaming out the current scan line. The streamer is not disrupted so it has to get priority. You can be confident it is worth coding and should work out. I guess this is for trying some single COG Ethernet driver idea...or maybe something related to EtherCAT?

    The block move will try to consume all the available bandwidth. The streamer should get priority to be able to continue to run but this would mess up egg-beater schedule and bring the bandwidth down from 1 long/cycle to around 1 long/9 cycles I fear.

    It doesn't seem to lose as much bandwidth as you'd expect. I don't have specific numbers but I know when I did some calculations on pixel throughput performance at different resolutions ages ago I happily found it seemed to lose less than 7 longs per streamer access (which is what I had feared and would totally kill performance). After reading what evanh posted above that seems to make sense now if indeed the P2 does its FIFO reads in bursts of more than one long at a time.

  • evanhevanh Posts: 16,039
    edited 2022-02-18 00:03

    Huh, unimportant but I discovered the weirdest thing: Forgetting to setup the FIFO, ie: missing the RDFAST before an XINIT, the XINIT still consumes hubRAM slot timings. I have no idea if valid data is read though.

  • It probably just reads from some random address left in its current FIFO address register. Is the data valid (from some random hub address so it might look random) or it is zeroes?

  • evanhevanh Posts: 16,039
    xfrq    long    $1000_0000      ' 1/8
    xmod    long    DM_8bRF | $ffff     ' 1/4 forever
    wlen    long    $3fff           ' 16 kLW (64 kByte)
    
    _main
            rdfast  #0, #0      ' burst length 6, so sysclock / (8 * 4 * 6) = 192 ticks between bursts
            setq    xfrq
            xinit   xmod, #0
            waitx   #500        ' make sure FIFO is steady
    
            wrlong  pa, #1      ' align first GETCT for zero gap in time.
            getct   ticks
            setq    wlen
            wrlong  #0, #0
            getct   pa
            sub pa, ticks
            ...
    

    Without Streamer active, commented out, measured time is 16390 ticks (16384 + 6 for instructions).
    With Streamer active measure time is 17102 ticks. -6 for instructions and that's 17096 - 16384 = 712 extra clock cycles. Which is 712 / 8 = 89 stalls during the block write .. that's about 4 extra stalls than needed if all FIFO bursts were 6 longwords

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-02-18 00:50

    @evanh said:

    @TonyB_ said:
    If you try this could you time how long the fast block write takes?

    It has to be writing in ridged address order. Anything else is too complicated. Which implies an eight clock cycle extension on each missed slot ...

    What is "ridged address order"?

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-02-18 01:03

    @evanh said:
    Without Streamer active, commented out, measured time is 16390 ticks (16384 + 6 for instructions).
    With Streamer active measure time is 17102 ticks. -6 for instructions and that's 17096 - 16384 = 712 extra clock cycles. Which is 712 / 8 = 89 stalls during the block write .. that's about 4 extra stalls than needed if all FIFO bursts were 6 longwords

    How many cycles per long for streamer? 32?

  • evanhevanh Posts: 16,039

    Doh! Rigid rather. 'twas spelling it how I pronounce it. :)

  • evanhevanh Posts: 16,039
    edited 2022-02-18 01:09

    @TonyB_ said:
    How many cycles per long for streamer? 32?

    8 ticks / byte -> 32 ticks / longword -> 192 ticks / burst.

    xfrq    long    $1000_0000      ' 1/8
    
  • TonyB_TonyB_ Posts: 2,196
    edited 2022-02-18 01:15

    @TonyB_ said:

    @evanh said:
    Without Streamer active, commented out, measured time is 16390 ticks (16384 + 6 for instructions).
    With Streamer active measure time is 17102 ticks. -6 for instructions and that's 17096 - 16384 = 712 extra clock cycles. Which is 712 / 8 = 89 stalls during the block write .. that's about 4 extra stalls than needed if all FIFO bursts were 6 longwords

    How many cycles per long for streamer? 32?

    Evan, you need to consider the extra time the stalls have added. 17096/32 cycles = 534.25 streamer longs. 534.25/89 = 6 as near as makes no difference.

  • evanhevanh Posts: 16,039

    @TonyB_ said:
    Evan, you need to consider the extra time the stalls have added. 17096/32 cycles = 534.25 streamer longs. 534.25/89 = 6 as near as makes no difference.

    That's burst time, not stall time.

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-02-18 12:45

    @evanh said:

    @TonyB_ said:
    Evan, you need to consider the extra time the stalls have added. 17096/32 cycles = 534.25 streamer longs. 534.25/89 = 6 as near as makes no difference.

    That's burst time, not stall time.

    You said

    that's about 4 extra stalls than needed if all FIFO bursts were 6 longwords

    but I think it's the right number of stalls. Bedtime.

    P.S. FIFO burst length should ideally be no more than 8.

  • roglohrogloh Posts: 5,837
    edited 2022-02-18 03:31

    In some ways I sort of feel we should be grateful that Chip has thought carefully about this during the design stage, if he hadn't we could have had a real performance killer with block transfers occurring during FIFO + streamer access. The way it is now, is that we get to keep the HUB transfer utilization high when block transfers and streamer FIFO are both being used at the same time, which is totally awesome. We don't seem to lose an entire hub window (or 7/8ths of it) each time the streamer kicks in transferring a single item. If 6 long transfers are made each time the FIFO reads from HUB, yes we'd lose the hub window for that time, but it now only happens 1/6 of the time now and potentially only two or 3 slots are unused at that time.

    One thing this sort of means though is that at times the FIFO must actually defer to the SETQ block transfer request up to a point before taking over. Some type of hysteresis/water mark triggering is probably going on here to cause that behaviour, otherwise the FIFO request backlog could simply never build up or underflow (when operating under 1/8th of the total HUB bandwidth) and the FIFO would kick in every time it needed another long. So I think this control logic probably ties in with the rule about filling continuously up to COGS+7 stages. If below this number the FIFO has precedence, but above it, maybe SETQ does to some degree...

  • evanhevanh Posts: 16,039

    Okay, main points:

    • Completion time is when 16384 longwords (64 kBytes) is written. 16384 ticks when uninterrupted.
    • I'm measuring an additional 712 ticks, which is exactly, no error, 89 stalls occurring - on the reasonable assumption that each stall is 8 ticks long.
    • Tony has calculated that 89 six longword bursts would fit quite nicely in 17096 ticks. Which suggests they are all that size. Which is good.

    So, maybe I've got something wrong with my expected 85 stalls ... I worked it out as 16384 += 16384 / 24 -> 17067 ticks. The / 24 comes from 8 / 192, which is from the assumed 8 tick extension for each write stall and 192 ticks between each 6 longword FIFO burst. And 17096 - 17067 = 29 ticks too long. And I did presume those all came from extra stalls.

  • evanhevanh Posts: 16,039
    edited 2022-02-18 03:16

    @rogloh said:
    ... If 6 long transfers are made each time the FIFO reads from HUB, yes we'd lose the hub window for that time, but it now only happens 1/6 of the time now and potentially only two or 3 slots are unused at that time.

    Absolutely, that's a good thing! Chip's done well. I've gone back and corrected my wrong assumption there.

  • evanhevanh Posts: 16,039
    edited 2022-02-18 03:50

    Doh! 17067 / 192 = 88.9. Missed that. So 89 stalls is expected, and the time taken is exactly that. Okay, so the 29 excess ticks is just an error in the ratio from 16384 to 17067 I guess ... yeah, 16384 += 16384 / 23 -> 17096

  • Chip has done well. It's one of those things that might not be immediately apparent during the design stage to regular / inexperienced people if you don't consider system level impacts carefully.

    I recall a time at a networking company in Silicon Valley where we had a separate ASIC team diligently trying to develop a custom chip to stop us requiring multiple FPGAs on our flagship switch fabric board for a chassis to save costs. We probably spent a couple of million dollars or so on tools and the workers for around a year or so including the initial chip fab costs etc, and by the time the chip finally arrived it contained a nasty flaw IIRC because it had flow control without bus speed up which limited transfer performance whenever the flow control was active and so it would not work at wire speed for smaller packets. This likely happened because the HW designers on the project while good at Verilog etc probably weren't really system people and that team had worked in isolation from the day to day development team who might have had a chance to see this coming had they known what it was that was being designed and reviewed what was being built. Once the flaw was pointed out to the HW team and the denial passed they scrambled to come up with a workaround...

    Result: an ASIC AND now 6 new FPGAs were needed on the switch, one new FPGA for a pair of line card slots to store and forward packets and buffer during back pressure times. :tired_face: LOL

  • evanhevanh Posts: 16,039

    It's funny in a way, the FPGA is there to thoroughly test these things before going to ASIC.

  • evanhevanh Posts: 16,039
    edited 2022-02-18 14:56

    So for some reason the actual additional factor is 8 stall ticks / (192 burst interval ticks - 8 stall ticks) ... 16384 += 16384 * 8 / (192 - 8) -> 17096 ticks.

    EDIT: Nah, it's not that simple. Best guess now is a beat forms at that rate. Here's some more:

     (1 + 8 / (2 * 6))  * 131072 = 218453
     (1 + 8 / (2 * 4))  * 131072 = 262144
      sysclock/2  measured ticks = 262166
    
     (1 + 8 / (4 * 6))  * 131072 = 174763
      sysclock/4  measured ticks = 174774
    
     (1 + 8 / (8 * 6))  * 131072 = 152917
     (1 + 8 / (8 * 5))  * 131072 = 157286
      sysclock/8  measured ticks = 153454
    
     (1 + 8 / (16 * 6)) * 131072 = 141995
     (1 + 8 / (16 * 5)) * 131072 = 144179
      sysclock/16 measured ticks = 142998
    
     (1 + 8 / (32 * 6))  * 131072 = 136533
      sysclock/32  measured ticks = 136774
    
     (1 + 8 / (64 * 6))  * 131072 = 133803
      sysclock/64  measured ticks = 133870
    
     (1 + 8 / (128 * 6)) * 131072 = 132437
      sysclock/128 measured ticks = 132454
    
     (1 + 8 / (256 * 6)) * 131072 = 131755
     (1 + 8 / (256 * 5)) * 131072 = 131891
      sysclock/256 measured ticks = 131766
    
     (1 + 8 / (512 * 6)) * 131072 = 131413
     (1 + 8 / (512 * 5)) * 131072 = 131482
      sysclock/512 measured ticks = 131422
    

    EDIT2: Updated equations for multiples of divider.

  • Wow, this is really a lot of information and knowledge. I don't understand all the numbers but I think this is a good summary:

    @rogloh said:
    In some ways I sort of feel we should be grateful that Chip has thought carefully about this during the design stage, if he hadn't we could have had a real performance killer with block transfers occurring during FIFO + streamer access. The way it is now, is that we get to keep the HUB transfer utilization high when block transfers and streamer FIFO are both being used at the same time, which is totally awesome. We don't seem to lose an entire hub window (or 7/8ths of it) each time the streamer kicks in transferring a single item. If 6 long transfers are made each time the FIFO reads from HUB, yes we'd lose the hub window for that time, but it now only happens 1/6 of the time now and potentially only two or 3 slots are unused at that time.

    So I don't need to worry about bandwidth bottlenecks. The single cog Ethernet driver using the streamer for sending and smart pin serial shift registers for receiving should be possible. Not with full duplex full wire speed performance but at least a bit better than strict half duplex. Of course, using the streamer does require phase locked clocks for PHY and P2.

  • evanhevanh Posts: 16,039
    edited 2022-02-18 11:01

    @ManAtWork said:
    So I don't need to worry about bandwidth bottlenecks. The single cog Ethernet driver using the streamer for sending and smart pin serial shift registers for receiving should be possible. Not with full duplex full wire speed performance but at least a bit better than strict half duplex. Of course, using the streamer does require phase locked clocks for PHY and P2.

    Yeah, I like the sound of that. Nice to keep a driver fitting in one cog.

    At 200 MHz sysclock, a streamer transmitting 100 Mbit/s will use 1/64 of its allowable hubRAM bandwidth. (The sysclock/64 numbers above.)

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-02-18 11:11

    For best performance, FIFO burst length should equal number of cogs always.

    If burst = 6 then two cycles are wasted for every fast move stall. In Evan's example, if burst = 8 then number of stalls will be 6/8 or 75% of burst = 6 stalls, i.e. 67 compared to 89.

    More important tests are streaming longs at sysclk/2, sysclk/3 and sysclk/4.

  • evanhevanh Posts: 16,039
    edited 2022-02-18 11:22

    I've done /2 and /4 above. The summary being, everything down to sysclock/4 flowed well. Plenty of available hubRAM access.

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-02-18 14:50

    @evanh said:
    I've done /2 and /4 above.

    What is the fast move block size?

Sign In or Register to comment.