HUB RAM interface question
ManAtWork
Posts: 2,176
in Propeller 2
What would happen if I setup the streamer to read from hub ram at a rate of 1 byte per 16 clock cycles and while the streamer is running initiate a fast and long block move with a SETQ+WRLONG combination with a size of several hundred longs.
Will this work at all or is there a conflict between FIFO reads and block writes? If it works who will get priority? The block move will try to consume all the available bandwidth. The streamer should get priority to be able to continue to run but this would mess up egg-beater schedule and bring the bandwidth down from 1 long/cycle to around 1 long/9 cycles I fear.
Comments
I posed a similar question here, somewhat off-topic:
https://forums.parallax.com/discussion/comment/1535106/#Comment_1535106
Yes, it will work and the FIFO has priority. As 1 byte per 16 cycles is 1 long per 64 cycles, the block move should not have to yield to the FIFO very often. I don't know exactly how the FIFO re-fills itself and some testing of this would be very helpful but I can't spare the time at the moment.
I hope the FIFO operates in bursts of N longs in N cycles, the same rate as fast block moves do if no yielding. Higher streamer speeds demand this and I doubt there is a different mode at slower speeds.
Ah thanks, yes that makes sense. If the FIFO uses bursts to fill up then there should be enough bandwidth left.
Does it make a difference if the streamer runs byte or long reads (mode 1001 vs. 1011, for example)? I hope not. I think the FIFO should fill up with RFLONGs no matter how many bytes the streamer pulls out at once.
My understanding is FIFO hubRAM accesses are 32-bit, so always at full bandwidth. The RFxxxx instructions, or equivalent streamer ops, interact with the FIFO as needed for the op.
Which means there can be late writes to hubRAM. But then there can be way early reads too.
EDIT: I guess a closing final write can be less than 32-bit.
EDIT2: Yeah, so with the OP scenario, the FIFO will steal its single hubRAM slot but, as you've likely guessed, that will also disrupt the block write for 8 clock cycles because the block write will be ordered and forced to miss a slot.
That's an interesting mental exercise. The FIFO is the one that has the most flexibility but it's also the one with priority so that very flexibility isn't being taken advantage of ... unless it does delay until needing a whole rotation of slots ... I don't think it waits that long though.
EDIT3: I had that wrong. The FIFO pulls in a minimum of six longwords in a burst. Much better than I guessed.
Well, it makes a big difference in terms of hub RAM bandwidth consumed by the streamer for byte vs. long reads, but as Evan says FIFO <-> hub RAM moves are always in longs. You should not have a problem for your example.
Oh, reading the silicon doc:
If I'm reading that right, it is actually saying at least 6 longwords at a time. And maybe more if the flow rate is high. That's a nice improvement on the single longs I was thinking it might be.
PS: Nothing in the doc about FIFO writes to hubRAM though.
EDIT: Ah, there is this little hint:
What that implies is FIFO writes do auto-flush after some time without needing an explicit RDFAST or WRFAST reissued. Which in-turn implies delayed writes are modus operandi.
If you try this could you time how long the fast block write takes?
It has to be writing in ridged address order. Anything else is too complicated. Which implies an eight clock cycle extension on each missed slot ...
Already replied to by others, but I know this works for real as I do exactly this in my video driver when I prepare a next scan line worth of data pixels and write it back using large SETQ2 burst writes to HUB while the streamer + FIFO is steaming out the current scan line. The streamer is not disrupted so it has to get priority. You can be confident it is worth coding and should work out. I guess this is for trying some single COG Ethernet driver idea...or maybe something related to EtherCAT?
It doesn't seem to lose as much bandwidth as you'd expect. I don't have specific numbers but I know when I did some calculations on pixel throughput performance at different resolutions ages ago I happily found it seemed to lose less than 7 longs per streamer access (which is what I had feared and would totally kill performance). After reading what evanh posted above that seems to make sense now if indeed the P2 does its FIFO reads in bursts of more than one long at a time.
Huh, unimportant but I discovered the weirdest thing: Forgetting to setup the FIFO, ie: missing the RDFAST before an XINIT, the XINIT still consumes hubRAM slot timings. I have no idea if valid data is read though.
It probably just reads from some random address left in its current FIFO address register. Is the data valid (from some random hub address so it might look random) or it is zeroes?
Without Streamer active, commented out, measured time is 16390 ticks (16384 + 6 for instructions).
With Streamer active measure time is 17102 ticks. -6 for instructions and that's 17096 - 16384 = 712 extra clock cycles. Which is 712 / 8 = 89 stalls during the block write .. that's about 4 extra stalls than needed if all FIFO bursts were 6 longwords
What is "ridged address order"?
How many cycles per long for streamer? 32?
Doh! Rigid rather. 'twas spelling it how I pronounce it.
8 ticks / byte -> 32 ticks / longword -> 192 ticks / burst.
Evan, you need to consider the extra time the stalls have added. 17096/32 cycles = 534.25 streamer longs. 534.25/89 = 6 as near as makes no difference.
That's burst time, not stall time.
You said
but I think it's the right number of stalls. Bedtime.
P.S. FIFO burst length should ideally be no more than 8.
In some ways I sort of feel we should be grateful that Chip has thought carefully about this during the design stage, if he hadn't we could have had a real performance killer with block transfers occurring during FIFO + streamer access. The way it is now, is that we get to keep the HUB transfer utilization high when block transfers and streamer FIFO are both being used at the same time, which is totally awesome. We don't seem to lose an entire hub window (or 7/8ths of it) each time the streamer kicks in transferring a single item. If 6 long transfers are made each time the FIFO reads from HUB, yes we'd lose the hub window for that time, but it now only happens 1/6 of the time now and potentially only two or 3 slots are unused at that time.
One thing this sort of means though is that at times the FIFO must actually defer to the SETQ block transfer request up to a point before taking over. Some type of hysteresis/water mark triggering is probably going on here to cause that behaviour, otherwise the FIFO request backlog could simply never build up or underflow (when operating under 1/8th of the total HUB bandwidth) and the FIFO would kick in every time it needed another long. So I think this control logic probably ties in with the rule about filling continuously up to COGS+7 stages. If below this number the FIFO has precedence, but above it, maybe SETQ does to some degree...
Okay, main points:
So, maybe I've got something wrong with my expected 85 stalls ... I worked it out as 16384 += 16384 / 24 -> 17067 ticks. The / 24 comes from 8 / 192, which is from the assumed 8 tick extension for each write stall and 192 ticks between each 6 longword FIFO burst. And 17096 - 17067 = 29 ticks too long. And I did presume those all came from extra stalls.
Absolutely, that's a good thing! Chip's done well. I've gone back and corrected my wrong assumption there.
Doh! 17067 / 192 = 88.9. Missed that. So 89 stalls is expected, and the time taken is exactly that. Okay, so the 29 excess ticks is just an error in the ratio from 16384 to 17067 I guess ... yeah, 16384 += 16384 / 23 -> 17096
Chip has done well. It's one of those things that might not be immediately apparent during the design stage to regular / inexperienced people if you don't consider system level impacts carefully.
I recall a time at a networking company in Silicon Valley where we had a separate ASIC team diligently trying to develop a custom chip to stop us requiring multiple FPGAs on our flagship switch fabric board for a chassis to save costs. We probably spent a couple of million dollars or so on tools and the workers for around a year or so including the initial chip fab costs etc, and by the time the chip finally arrived it contained a nasty flaw IIRC because it had flow control without bus speed up which limited transfer performance whenever the flow control was active and so it would not work at wire speed for smaller packets. This likely happened because the HW designers on the project while good at Verilog etc probably weren't really system people and that team had worked in isolation from the day to day development team who might have had a chance to see this coming had they known what it was that was being designed and reviewed what was being built. Once the flaw was pointed out to the HW team and the denial passed they scrambled to come up with a workaround...
Result: an ASIC AND now 6 new FPGAs were needed on the switch, one new FPGA for a pair of line card slots to store and forward packets and buffer during back pressure times. LOL
It's funny in a way, the FPGA is there to thoroughly test these things before going to ASIC.
So for some reason the actual additional factor is 8 stall ticks / (192 burst interval ticks - 8 stall ticks) ... 16384 += 16384 * 8 / (192 - 8) -> 17096 ticks.
EDIT: Nah, it's not that simple. Best guess now is a beat forms at that rate. Here's some more:
EDIT2: Updated equations for multiples of divider.
Wow, this is really a lot of information and knowledge. I don't understand all the numbers but I think this is a good summary:
So I don't need to worry about bandwidth bottlenecks. The single cog Ethernet driver using the streamer for sending and smart pin serial shift registers for receiving should be possible. Not with full duplex full wire speed performance but at least a bit better than strict half duplex. Of course, using the streamer does require phase locked clocks for PHY and P2.
Yeah, I like the sound of that. Nice to keep a driver fitting in one cog.
At 200 MHz sysclock, a streamer transmitting 100 Mbit/s will use 1/64 of its allowable hubRAM bandwidth. (The sysclock/64 numbers above.)
For best performance, FIFO burst length should equal number of cogs always.
If burst = 6 then two cycles are wasted for every fast move stall. In Evan's example, if burst = 8 then number of stalls will be 6/8 or 75% of burst = 6 stalls, i.e. 67 compared to 89.
More important tests are streaming longs at sysclk/2, sysclk/3 and sysclk/4.
I've done /2 and /4 above. The summary being, everything down to sysclock/4 flowed well. Plenty of available hubRAM access.
What is the fast move block size?